Abstract: We introduce the second edition of the CheckThat! Lab, part of the 2019 Cross-Language Evaluation Forum (CLEF). CheckThat! proposes two complementary tasks. Task 1: predict which claims in a political debate should be prioritized for fact-checking. Task 2: rank Web-retrieved pages against a check-worthy claim based on their usefulness for fact-checking, extract useful passages from those pages, and then use them all to decide whether the claim is factually true or false. Checkthat! provides a full evaluation framework, consisting of data in English (derived from fact-checking sources) and Arabic (gathered and annotated from scratch) and evaluation based on mean average precision (MAP) for ranking and F\(_1\) for classification tasks.
TL;DR: An overview of the 2nd edition of the CheckThat!
Abstract: We present an overview of the 2nd edition of the CheckThat! Lab, part of CLEF 2019, with focus on Task 1: Check-Worthiness in political debates. The task asks to predict which claims in a political debate should be prioritized for fact-checking. In particular, given a debate or a political speech, the goal is to produce a ranked list of its sentences based on their worthiness for fact-checking. This year, we extended the 2018 dataset with 16 more debates and speeches. A total of 47 teams registered to participate in the lab, and eleven of them actually submitted runs for Task 1 (compared to seven last year). The evaluation results show that the most successful approaches to Task 1 used various neural networks and logistic regression. The best system achieved mean average precision of 0.166 (0.250 on the speeches, and 0.054 on the debates). This leaves large room for improvement, and thus we release all datasets and scoring scripts, which should enable further research in check-worthiness estimation.
TL;DR: This paper approaches the CLEF eHealth challenge 2019, Task 1 of automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes as multi-label classification problem and leverage the multi-lingual version of the BERT text encoding model to represent the summaries.
Abstract: In this paper we present our contribution to the CLEF eHealth challenge 2019, Task 1. The task involves the automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes. We approach the task as multi-label classification problem and leverage the multi-lingual version of the BERT text encoding model [6] to represent the summaries. The model is extended by a single output layer to produce probabilities for individual ICD-10 codes. In addition, we make use of extra training data from the German Clinical Trials Register and ensemble several model instances to improve the overall performance of our approach. We compare our model with five baseline systems including a dictionary matching approach and single-label SVM and BERT classification models. Experiments on the development set highlight the advantage of our approach compared to the baselines with an improvement of 3.6%. Our model achieves the overall best performance in the challenge reaching an F1 score of 0.80 in the final evaluation.
TL;DR: The NTSs of planned animal experiments in Germany are publicly available and have been manually assigned to ICD-10 codes and used in the scope of organizing the Multilingual Information Extraction Task in the CLEF eHealth challenge.
Abstract: Non-technical summaries (NTSs) of animal experimentation can be valuable resources to foster more transparency of research made with animals and to better inform the community about this topic. The NTSs of planned animal experiments in Germany are publicly available and have been manually assigned to ICD-10 codes. We used this data in the scope of organizing the Multilingual Information Extraction Task (Task 1) in the CLEF eHealth challenge. For the development phase, we released a training dataset containing more than 8,000 NTSs and their corresponding codes (if any assigned). For the test phase, we released 407 unseen NTSs for which the participants should submit the predictions made by their systems. The best performing system obtained a P, R, and FM of 0.83, 0.77, and 0.80, respectively.
TL;DR: A summary of the motivations which led to the establishment of CLEF, and a description of how it has evolved over the years, the major achievements, and what the authors see as the next challenges are provided.
Abstract: 2019 marks the 20 birthday for CLEF, an evaluation campaign activity which has applied the Cranfield evaluation paradigm to the testing of multilingual and multimodal information access systems in Europe. This paper provides a summary of the motivations which led to the establishment of CLEF, and a description of how it has evolved over the years, the major achievements, and what we see as the next challenges.
TL;DR: This report details the investigations in applying state-ofthe-art pre-trained Deep Learning models to the problems of Automated Claim Detection and Fact Checking, as part of the CLEF’19 Lab: CheckThat!
Abstract: This report details our investigations in applying state-ofthe-art pre-trained Deep Learning models to the problems of Automated Claim Detection and Fact Checking, as part of the CLEF’19 Lab: CheckThat!: Automatic Identification and Verification of Claims. The report provides an overview of the experiments performed on these tasks, which continue to be extremely challenging for current technology. The research focuses mainly on the use of pre-trained deep neural text embeddings that through transfer learning can allow for improved classification performance on small and unbalanced text datasets. We also investigate the effectiveness of external data sources for improving prediction accuracy on the claim detection and fact checking tasks. Our team submitted runs for every task/subtask of the challenge. The results appeared satisfactory for task 1 and promising but less satisfactory for task 2. A detailed explanation of the steps performed to obtain the submitted results is provided, including comparison tables between our submissions and other techniques investigated.
TL;DR: A feature selection procedure is applied and a Zeta model is proposed to apply to reduce the number of decisions taken by the kNN classifier to solve the “bots and gender” subtask.
Abstract: When participating in the “bots and gender” subtask (both in English and Spanish), our aim is to automatically detect different text sources (sequence of tweets sent by a bot or a human). When a text is identified as being sent by humans, the system must determine the author’s gender (author profiling). To solve these questions, we focus on a simple classifier (k-NN, k = 5) usually able to produce a correct answer but not in an efficient way. Thus, we apply a feature selection procedure to reduce the number of terms (around 200 to 500). We also propose to apply a Zeta model to reduce the number of decisions taken by the kNN classifier. In this case, we focus on terms used in one category and ignored or used rarely by the second. In addition, the Type-Token Ratio of the lexical density (LD) presents some merit to discriminate between tweets sent by a bot (TTR < 0.2, LD ≥ 0.8) or humans (TTR ≥ 0.2, LD < 0.8).
TL;DR: The authors used text information without any hand-crafted features or dictionaries to model the temporal mood variation detected from users posts and used two learning phases through exploration of state-of-the-art text vectorization.
Abstract: Two tasks are proposed at CLEF eRisk-2018 on predicting mental disorder using Users posts on Reddit. Depression and anorexia disorders are considered to be detected as early as possible. In this paper we present the participation of LIRMM (Laboratoire d’Informatique, de Robotique et de Micro´electronique de Montpellier) in both tasks. The proposed architectures and models use only text information without any hand-crafted features or dictionaries to model the temporal mood variation detected from users posts. The proposed models use two learning phases through exploration of state-of-the-art text vectorization. The proposed models perform comparably to other contributions while experiments shows that document-level outperformed word-level vectorizations.
TL;DR: This introductory chapter begins by explaining briefly what is intended by experimental evaluation in information retrieval in order to provide the necessary background for the rest of this volume.
Abstract: This introductory chapter begins by explaining briefly what is intended by experimental evaluation in information retrieval in order to provide the necessary background for the rest of this volume. The major international evaluation initiatives that have adopted and implemented in various ways this common framework are then presented and their relationship to CLEF indicated. The second part of the chapter details how the experimental evaluation paradigm has been implemented in CLEF by providing a brief overview of the main activities and results obtained over the last two decades. The aim has been to build a strong multidisciplinary research community and to create a sustainable technical framework that would not simply support but would also empower both research and development and evaluation activities, while meeting and at times anticipating the demands of a rapidly evolving information society.
TL;DR: This paper proposes an effective approach for retrieving check-worthy sentences within American political debates, which relates to the first task of the CLEF CheckThat!
Abstract: Since information can be created and shared online by anyone, a lot of time and effort are required to manually fact-check all the information encountered by users everyday. Hence, an automatic factchecking process is needed to effectively fact-check the vast information available online. However, gathering information related to every single claim can also be redundant, as not all sentences or articles are checkworthy. In this paper, we propose an effective approach for retrieving check-worthy sentences within American political debates, which relates to the first task of the CLEF CheckThat! 2019 Lab. To rank sentences based on their check-worthiness, we propose to represent each sentence using their mentioned entities using a TF-IDF representation. We use a SVM classifier to predict the check-worthiness of each sentence. Our approach ranked 4th out of 12 submissions. Our experiments show that the pronouns and coreference resolution pre-processing procedure we use as part of our approach does improve the effectiveness of sentence checkworthiness prediction. Furthermore, our results show that entity analysis features provide valuable evidence for this task.
TL;DR: This paper presents a hybrid approach which combines rule-based and supervised methods for CLEF-2019 Check That!
Abstract: In recent years, we witnessed an incredible amount of misinformation spread over the Internet. However, it is extremely time consuming to analyze the veracity of every claim made on the Internet. Thus, we urgently need automated systems that can prioritize claims based on their check-worthiness, helping fact-checkers to focus on important claims. In this paper, we present our hybrid approach which combines rule-based and supervised methods for CLEF-2019 Check That! Lab’s Check-Worthiness task. Our primary model ranked 9 based on MAP, and 6 based on R-P, P@5, and P@20 metrics in the official evaluation of primary submissions.
TL;DR: This paper describes the University of Sheffield’s approach to the CLEF 2019 eHealth Task 2: Technologically Assisted Reviews in Empirical Medicine, which focuses on identifying relevant studies for systematic reviews.
Abstract: This paper describes the University of Sheffield’s approach to the CLEF 2019 eHealth Task 2: Technologically Assisted Reviews in Empirical Medicine. This task focuses on identifying relevant studies for systematic reviews. The University of Sheffield participated in subtask 2 (Abstract and Title Screening). Our approach used lexical statistics (LogLikelihood, Chi-Squared and Odds-Ratio) to identify terms that retrieve specific types of evidence. A total of 12 official runs were submitted.
TL;DR: This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009, and describes the most important challenges when designing a IR system for a new language.
Abstract: This chapter describes the lessons learnt from the ad hoc track at CLEF in the years 2000 to 2009. This contribution focuses on Information Retrieval (IR) for languages other than English (monolingual IR), as well as bilingual IR (also termed “cross-lingual”; the request is written in one language and the searched collection in another), and multilingual IR (the information items are written in many different languages). During these years the ad hoc track has used mainly newspaper test collections, covering more than 15 languages. The authors themselves have designed, implemented and evaluated IR tools for all these languages during those CLEF campaigns. Based on our own experience and the lessons reported by other participants in these years, we are able to describe the most important challenges when designing a IR system for a new language. When dealing with bilingual IR, our experiments indicate that the critical point is the translation process. However, currently online translating systems tend to offer rather effective translation from one language to another, especially when one of these languages is English. In order to solve the multilingual IR question, different IR architectures are possible. For the simplest approach based on query translation of individual language pairs, the crucial component is the merging of the intermediate bilingual results. When considering both document and query translation, the complexity of the whole system represents clearly a main issue.
TL;DR: This chapter details the efforts of the INEX lab in CLEF (2012–2014), as well as the ongoing activities as separate labs, under the labels Social Book Search (2015–2016), and Microblog Contextualization (2016–2018).
Abstract: INEX ran as an independent evaluation forum for 10 years before it teamed up with CLEF in 2012. Even before 2012 there was considerable collaboration between INEX and CLEF, and these collaborations increased in intensity when CLEF moved beyond its traditional cross-lingual focus in 2009/2010 shifting to include all experimental IR. This led to the merger of CLEF and INEX, and effectively to the inclusion of INEX as a large track or lab into CLEF in 2012. This chapter details the efforts of the INEX lab in CLEF (2012–2014), as well as the ongoing activities as separate labs, under the labels Social Book Search (2015–2016), and Microblog Contextualization (2016–2018).
TL;DR: The CLEF eHealth evaluation series to-date is described and then the 2019 tasks, evaluation methodology, and resources are presented.
Abstract: Since 2012 CLEF eHealth has focused on evaluation resource building efforts around the easing and support of patients, their next-of-kins, clinical staff, and health scientists in understanding, accessing, and authoring eHealth information in a multilingual setting. This year’s lab offers three tasks: Task 1 on multilingual information extraction; Task 2 on technology assisted reviews in empirical medicine; and Task 3 on consumer health search in mono- and multilingual settings. Herein, we describe the CLEF eHealth evaluation series to-date and then present the 2019 tasks, evaluation methodology, and resources.
TL;DR: The analysis of the productivity and citation impact of CLEF in the period 2010–2017 shows that CLEF is a very strong and vibrant initiative that has managed a major change of format between 2009/2010 and continues to produce relevant research, datasets and tools.
Abstract: This chapter assesses the scholarly impact of the CLEF evaluation campaign by performing a bibliometric analysis of the citations of the CLEF 2010–2017 papers collected through Google Scholar. The analysis extends an earlier 2013 study by Tsikrika et al. of the CLEF Proceedings for the period 2000–2009 and compares the impact of the first half of CLEF to the second. It also extends the analysis by including the CLEF Working notes, a less formal but important part of the CLEF oeuvre. Results show that, despite the different nature of the peer-reviewed CLEF Proceedings papers and the less formal and much more numerous Working note papers, both types of publications have high citation impact. In particular, overview papers from the various labs and tasks in CLEF attract large amounts of citations in both Proceedings and Working Notes. A significant proportion of the total number of citations appear to be from outside CLEF—there are simply not enough CLEF papers every year to explain that many citations. In conclusion, the analysis of the productivity and citation impact of CLEF in the period 2010–2017 shows that CLEF is a very strong and vibrant initiative that has managed a major change of format between 2009/2010 and continues to produce relevant research, datasets and tools.
TL;DR: This chapter reviews tasks examining speech and video retrieval carried out within CLEF during its first 10 years, and overviews related work reported at other information retrieval benchmarks.
Abstract: CLEF was initiated with intention of providing a catalyst to research in Cross-Language Information Retrieval (CLIR) and Multilingual Information Retrieval (MIR). Focusing principally on European languages, it initially provided CLIR benchmark tasks to the research community within an annual cycle of task design, conduct and reporting. While the early focus was on textual data, the emergence of technologies to enable collection, archiving and content processing of multimedia content led to several initiatives which sought to address search for spoken and visual content. Similar to the interest in multilingual search for text, interest arose in working multilingually with multimedia content. To support research in these areas CLEF introduced a number of tasks in multilingual search for multimedia content. While investigation of image retrieval has formed the focus of the ImageCLEF task over many years, this chapter reviews tasks examining speech and video retrieval carried out within CLEF during its first 10 years, and overviews related work reported at other information retrieval benchmarks.
TL;DR: PIR-CLEF 2019 provided registered participants with two tracks: the Web Search Task and the Medical Search Task, which focuses on personalisation within an ad hoc search task introduced in previous editions of the CLEF eHealth Lab.
Abstract: The Personalised Information Retrieval Lab (PIR-CLEF 2019) lab is an initiative aimed at both providing and critically analysing the evaluation of Personalization in Information Retrieval (PIR) applications. PIR-CLEF 2019 is the second edition of the Lab after the successful Pilot lab organised at CLEF 2017 and the first edition of the Lab at CLEF 2018. PIR-CLEF 2019 provided registered participants with two tracks: the Web Search Task and the Medical Search Task. The Web Search Task continues the activities introduced in the previous editions of the PIR-CLEF Lab, while the Medical Search Track focuses on personalisation within an ad hoc search task introduced in previous editions of the CLEF eHealth Lab.
TL;DR: In this paper, the authors describe the properties of their multilingual classification model submitted for PAN2019 that is able to recognize bots from humans, and females from males, and extracted 18 features from the user's posts and applying a machine learning algorithm obtained good performance results.
Abstract: Unfortunately, in social networks, software bots or just bots are becoming more and more common because malicious people have seen their usefulness to spread false messages, spread rumors and even manipulate public opinion. Even though the text generated by users in social networks is a rich source of information that can be used to identify different aspects of its authors, not being able to recognize which users are truly humans and which are not, is a big drawback. In this work, we describe the properties of our multilingual classification model submitted for PAN2019 that is able to recognize bots from humans, and females from males. This solution extracted 18 features from the user’s posts and applying a machine learning algorithm obtained good performance results.
TL;DR: An approach for the PAN 2019 Author Profiling challenge is presented to detect Twitter bots and also to classify the gender of human Twitter users as male or female.
Abstract: In this paper we present an approach for the PAN 2019 Author Profiling challenge. The task here is to detect Twitter bots and also to classify the gender of human Twitter users as male or female, b ...
TL;DR: This paper describes the experiences with the interactive IR tracks organized at CLEF from 2013-2016 and aggregates the lessons learned with each consecutive instance of the lab, to provide practical insights and lessons for future collaborative interactive IR evaluation exercises and for potential re-use scenarios.
Abstract: This paper describes our experiences with the interactive IR tracks organized at CLEF from 2013-2016 and aggregates the lessons learned with each consecutive instance of the lab. We end with a summary of practical insights and lessons for future collaborative interactive IR evaluation exercises and for potential re-use scenarios.
TL;DR: This paper describes an experiment to evaluate the performance of three different types of semantic vectors or word embeddings-random indexing, GloVe, and ELMo-and two different classification arch models.
Abstract: This paper describes an experiment to evaluate the performance of three different types of semantic vectors or word embeddings-random indexing, GloVe, and ELMo-and two different classification arch ...
TL;DR: This work presents a variation of the system presented last year, in particular, not only is the maximum amount of documents that the physician is willing to read set, but the effort is distributed proportionally to the number of documents in the pool.
Abstract: This is the third participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem which requires a recall close (if not equal) to 100%. Semi-Automated approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present a variation of the system we presented last year; in particular, not only we set the maximum amount of documents that the physician is willing to read, but we distribute the effort across the topics proportionally to the number of documents in the pool. We compare the results of this approach with the “frozen” system we used in 2018 and a BM25 baseline.
TL;DR: The models were trained on a data corpus collected from Indian news sources, but evaluated on data obtained from news sources from other countries as well, such as China, to use contextualized string embeddings.
Abstract: In this work we describe our results achieved in the ProtestNews Lab at CLEF 2019. To tackle the problems of event sentence detection and event extraction we decided to use contextualized string embeddings. The models were trained on a data corpus collected from Indian news sources, but evaluated on data obtained from news sources from other countries as well, such as China. Our models have obtained competitive results and have scored 3rd in the event sentence detection task and 1st in the event extraction task based on average F1-scores for different test datasets.
TL;DR: The DenseNet architecture with competitive performance and relatively low number of parameters is augmented with a non-local block in an attempt to tackle the data deficient challenge in PlantCLEF 2019.
Abstract: Image-based plant identification is a promising tool constituting the automation of agriculture and environmental conservation as stated in. As an attempt to tackle the data deficient challenge in PlantCLEF 2019, the DenseNet architecture with competitive performance and relatively low number of parameters is augmented with a non-local block. A variety of data sampling schemes are also evaluated as a part of the work. The evaluation of the model and the methods is detailed in the content of the paper.
TL;DR: The third participation of the Information Management Systems (IMS) group at CLEF eHealth 2019 Task 1.1 is described, in which participants are required to label with ICD-10 codes health-related documents with the focus on the German language and on non-technical summaries of animal experiments.
Abstract: In this paper, we describe the third participation of the Information Management Systems (IMS) group at CLEF eHealth 2019 Task 1. In this task, participants are required to label with ICD-10 codes health-related documents with the focus on the German language and on non-technical summaries (NTPs) of animal experiments. We tackled this task by focusing on reproducibility aspects, as we did the previous years. This time, we tried three different probabilistic Näıve Bayes classifiers that use different hypothesis on the distribution of terms in the documents and the collection. The experimental evaluation showed a significantly different behavior of the classifiers during the training phase and the test phase. We are currently investigating possible sources of biases introduced in the training phase as well as out-of-vocabulary issues and change in the terminology from the training set to the test set.
TL;DR: The Question Answering track at CLEF ran for 13 years, from 2003 until 2015, and is divided into four eras, with the description and the main results for each of these eras, together with the pilot exercises and other Question AnSWering tasks that ran in CLEF.
Abstract: The Question Answering track at CLEF ran for 13 years, from 2003 until 2015. Along these years, many different tasks, resources and evaluation methodologies were developed. We divide the CLEF Question Answering campaigns into four eras: (1) Ungrouped mainly factoid questions asked against monolingual newspapers (2003–2006), (2) Grouped questions asked against newspapers and Wikipedias (2007–2008), (3) Ungrouped questions against multilingual parallel-aligned EU legislative documents (2009–2010), and (4) Questions about a single document using a related document collection as background information (2011–2015). We provide the description and the main results for each of these eras, together with the pilot exercises and other Question Answering tasks that ran in CLEF. Finally, we conclude with some of the lessons learnt along these years.
Abstract: This paper describes the University of Sheffield’s approach to the CLEF 2019 eHealth Task 2: Technologically Assisted Reviews in Empirical Medicine. This task focuses on identifying relevant studies for systematic reviews. The University of Sheffield participated in subtask 2 (Abstract and Title Screening). Our approach used lexical statistics (LogLikelihood, Chi-Squared and Odds-Ratio) to identify terms that retrieve specific types of evidence. A total of 12 official runs were submitted.
TL;DR: This study proposes a k-nearest neighbor (k-NN) and Threshold (t-nn) approaches to classify animal experiment summaries into its correct ICD-10 codes and another two methods are proposed to control and adjust the retrieved labels of the documents results to assign ICD -10 codes for the issued query document.
Abstract: Information retrieval searching systems recently become powerful for retrieving full text results according to a particular query (or else a document query). Elastic search is an open source information retrieval searching system that is built on Apache Lucene, and works as a distributed search and analytics engine at the same time. Therefore, this engine can also be used as one of machine learnings’ approaches to solve some challenges such as document classification problem. This study is published as working-notes paper for CLEF eHealth 2019 Task 1 on Multilingual Information Extraction and it proposes a k-nearest neighbor (k-NN) and Threshold (t-NN) approaches to classify animal experiment summaries into its correct ICD-10 codes. After that, another two methods are proposed to control and adjust the retrieved labels of the documents results to assign ICD-10 codes for the issued query document. These approaches register high precision, recall and f-measure after we experiment it with the development dataset.