TL;DR: What Happened in CLEF 2004?.- What Happens in CLEf 2004?
Abstract: What Happened in CLEF 2004?.- What Happened in CLEF 2004?.- I. Ad Hoc Text Retrieval Tracks.- CLEF 2004: Ad Hoc Track Overview and Results Analysis.- Selection and Merging Strategies for Multilingual Information Retrieval.- Using Surface-Syntactic Parser and Deviation from Randomness.- Cross-Language Retrieval Using HAIRCUT at CLEF 2004.- Experiments on Statistical Approaches to Compensate for Limited Linguistic Resources.- Application of Variable Length N-Gram Vectors to Monolingual and Bilingual Information Retrieval.- Integrating New Languages in a Multilingual Search System Based on a Deep Linguistic Analysis.- IR-n r2: Using Normalized Passages.- Using COTS Search Engines and Custom Query Strategies at CLEF.- Report on Thomson Legal and Regulatory Experiments at CLEF-2004.- Effective Translation, Tokenization and Combination for Cross-Lingual Retrieval.- Two-Stage Refinement of Transitive Query Translation with English Disambiguation for Cross-Language Information Retrieval: An Experiment at CLEF 2004.- Dictionary-Based Amharic - English Information Retrieval.- Dynamic Lexica for Query Translation.- SINAI at CLEF 2004: Using Machine Translation Resources with a Mixed 2-Step RSV Merging Algorithm.- Mono- and Crosslingual Retrieval Experiments at the University of Hildesheim.- University of Chicago at CLEF2004: Cross-Language Text and Spoken Document Retrieval.- UB at CLEF2004: Cross Language Information Retrieval Using Statistical Language Models.- MIRACLE's Hybrid Approach to Bilingual and Monolingual Information Retrieval.- Searching a Russian Document Collection Using English, Chinese and Japanese Queries.- Dublin City University at CLEF 2004: Experiments in Monolingual, Bilingual and Multilingual Retrieval.- Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServerTM at CLEF 2004.- Data Fusion for Effective European Monolingual Information Retrieval.- The XLDB Group at CLEF 2004.- The University of Glasgow at CLEF 2004: French Monolingual Information Retrieval with Terrier.- II. Domain-Specific Document Retrieval.- The Domain-Specific Track in CLEF 2004: Overview of the Results and Remarks on the Assessment Process.- University of Hagen at CLEF 2004: Indexing and Translating Concepts for the GIRT Task.- IRIT at CLEF 2004: The English GIRT Task.- Ricoh at CLEF 2004.- GIRT and the Use of Subject Metadata for Retrieval.- III. Interactive Cross-Language Information Retrieval.- iCLEF 2004 Track Overview: Pilot Experiments in Interactive Cross-Language Question Answering.- Interactive Cross-Language Question Answering: Searching Passages Versus Searching Documents.- Improving Interaction with the User in Cross-Language Question Answering Through Relevant Domains and Syntactic Semantic Patterns.- Cooperation, Bookmarking, and Thesaurus in Interactive Bilingual Question Answering.- Summarization Design for Interactive Cross-Language Question Answering.- Interactive and Bilingual Question Answering Using Term Suggestion and Passage Retrieval.- IV. Multiple Language Question Answering.- Overview of the CLEF 2004 Multilingual Question Answering Track.- A Question Answering System for French.- Cross-Language French-English Question Answering Using the DLT System at CLEF 2004.- Experiments on Robust NL Question Interpretation and Multi-layered Document Annotation for a Cross-Language Question/Answering System.- Making Stone Soup: Evaluating a Recall-Oriented Multi-stream Question Answering System for Dutch.- The DIOGENE Question Answering System at CLEF-2004.- Cross-Lingual Question Answering Using Off-the-Shelf Machine Translation.- Bulgarian-English Question Answering: Adaptation of Language Resources.- Answering French Questions in English by Exploiting Results from Several Sources of Information.- Finnish as Source Language in Bilingual Question Answering.- miraQA: Experiments with Learning Answer Context Patterns from the Web.- Question Answering for Spanish Supported by Lexical Context Annotation.- Question Answering Using Sentence Parsing and Semantic Network Matching.- First Evaluation of Esfinge - A Question Answering System for Portuguese.- University of Evora in QA@CLEF-2004.- COLE Experiments at QA@CLEF 2004 Spanish Monolingual Track.- Does English Help Question Answering in Spanish?.- The TALP-QA System for Spanish at CLEF 2004: Structural and Hierarchical Relaxing of Semantic Constraints.- ILC-UniPI Italian QA.- Question Answering Pilot Task at CLEF 2004.- Evaluation of Complex Temporal Questions in CLEF-QA.- V. Cross-Language Retrieval in Image Collections.- The CLEF 2004 Cross-Language Image Retrieval Track.- Caption and Query Translation for Cross-Language Image Retrieval.- Pattern-Based Image Retrieval with Constraints and Preferences on ImageCLEF 2004.- How to Visually Retrieve Images from the St. Andrews Collection Using GIFT.- UNED at ImageCLEF 2004: Detecting Named Entities and Noun Phrases for Automatic Query Expansion and Structuring.- Dublin City University at CLEF 2004: Experiments with the ImageCLEF St. Andrew's Collection.- From Text to Image: Generating Visual Query for Image Retrieval.- Toward Cross-Language and Cross-Media Image Retrieval.- FIRE - Flexible Image Retrieval Engine: ImageCLEF 2004 Evaluation.- MIRACLE Approach to ImageCLEF 2004: Merging Textual and Content-Based Image Retrieval.- Cross-Media Feedback Strategies: Merging Text and Image Information to Improve Image Retrieval.- ImageCLEF 2004: Combining Image and Multi-lingual Search for Medical Image Retrieval.- Multi-modal Information Retrieval Using FINT.- Medical Image Retrieval Using Texture, Locality and Colour.- SMIRE: Similar Medical Image Retrieval Engine.- A Probabilistic Approach to Medical Image Retrieval.- UB at CLEF2004 Cross Language Medical Image Retrieval.- Content-Based Queries on the CasImage Database Within the IRMA Framework.- Comparison and Combination of Textual and Visual Features for Interactive Cross-Language Image Retrieval.- MSU at ImageCLEF: Cross Language and Interactive Image Retrieval.- VI. Cross-Language Spoken Document Retrieval.- CLEF 2004 Cross-Language Spoken Document Retrieval Track.- VII. Issues in CLIR and in Evaluation.- The Key to the First CLEF with Portuguese: Topics, Questions and Answers in CHAVE.- How Do Named Entities Contribute to Retrieval Effectiveness?.
TL;DR: An overview of the 2005 QA track is provided, the procedure followed to build the test sets and the results are presented, showing that the best systems did not always provide the most reliable confidence score.
Abstract: The general aim of the third CLEF Multilingual Question Answering Track was to set up a common and replicable evaluation framework to test both monolingual and cross-language Question Answering (QA) systems that process queries and documents in several European languages. Nine target languages and ten source languages were exploited to enact 8 monolingual and 73 cross-language tasks. Twenty-four groups participated in the exercise. Overall results showed a general increase in performance in comparison to last year. The best performing monolingual system irrespective of target language answered 64.5% of the questions correctly (in the monolingual Portuguese task), while the average of the best performances for each target language was 42.6%. The cross-language step instead entailed a considerable drop in performance. In addition to accuracy, the organisers also measured the relation between the correctness of an answer and a system’s stated confidence in it, showing that the best systems did not always provide the most reliable confidence score. We provide an overview of the 2005 QA track, detail the procedure followed to build the test sets and present a general analysis of the results.
TL;DR: The Aim of the CLEF track is to explore the use of both text and content–based retrieval methods for cross–language image retrieval, with a focus on text-based methods.
Abstract: This paper outlines efforts from the 2005 CLEF cross– language image retrieval campaign (ImageCLEF). Aim of the CLEF track is to explore the use of both text and content–based retrieval methods for cross–language image retrieval. Four tasks were offered in ImageCLEF: ad–hoc retrieval from an historic photographic collection, ad–hoc retrieval from a medical collection, an automatic image annotation task, and a user–centered (interactive) evaluation task. 24 research groups from a variety of backgrounds and nationalities (14 countries) participated in ImageCLEF. This paper presents the ImageCLEF tasks, submissions from participating groups and a summary of the main findings.
TL;DR: QRISTAL as discussed by the authors is a question answering system making intensive use of natural language processing both for indexing documents and extracting answers, and it recently ranked first in the EQueR evaluation campaign (Evalda, Technolangue [3]).
Abstract: QRISTAL [8] is a question answering system making intensive use of natural language processing both for indexing documents and extracting answers. It recently ranked first in the EQueR evaluation campaign (Evalda, Technolangue [3]). This article proposes a functional description of the system. Then, it presents our results for the CLEF 2005 campaign and a critical description of the system. QRISTAL is possibly the first Question Answering system available on the consumer market. That fact generates drastic constraints and explains the technical choices we detail here.
TL;DR: The objectives were to attempt to measure progress in multilingual information retrieval by comparing the results for CLEF 2005 submissions with those of participants in earlier workshops, and also to encourage participants to explore multilingual list merging techniques.
Abstract: We describe the objectives and organization of the CLEF 2005 ad hoc track and discuss the main characteristics of the tasks offered to test monolingual, bilingual, and multilingual textual document retrieval. The performance achieved for each task is presented and a statistical analysis of results is given. The mono- and bilingual tasks followed the pattern of previous years but included target collections for two new-to-CLEF languages: Bulgarian and Hungarian. The multilingual tasks concentrated on exploring the reuse of existing test collections from an earlier CLEF campaign. The objectives were to attempt to measure progress in multilingual information retrieval by comparing the results for CLEF 2005 submissions with those of participants in earlier workshops, and also to encourage participants to explore multilingual list merging techniques.
TL;DR: A system to perform the task of generating a diverse range of textual and graphical summaries of a patient’s clinical history from a data-encoded model, a chronicle, representing the record of the patient's medical history is described.
Abstract: At the centre of the Clinical e-Science Framework (CLEF) project is a repository of well organised, detailed clinical histories, encoded as data that will be av ailable for use in clinical care and in-silico medical experiments. We describe a system that we have developed as part of the CLEF project, to perform the task of generating a diverse range of textual and graphical summaries of a patient’s clinical history from a data-encoded model, a chronicle, representing the record of the patient’s medical history. Although the focus of our current work is on cancer patients, the approach we describe is generalisable to a wide range of medical areas.
TL;DR: The official runs of the Dutch team for QA@CLEF 2003 were described in this article, where they took part in the monolingual Dutch Question Answering task.
Abstract: This paper describes the official runs of our team for QA@CLEF 2003. We took part in the monolingual Dutch Question Answering task.
TL;DR: The task design, experimental methodology, and the results obtained by the research groups participating in the iCLEF Interactive Track are summarized.
Abstract: The CLEF Interactive Track (iCLEF) is devoted to the comparative study of user-inclusive cross-language search strategies. In 2005, we have studied two cross-language search tasks: retrieval of answers and retrieval of annotated images. In both tasks, no further translation or post-processing is needed after performing the tasks to fulfill the information need.
In the interactive Question Answering task, users are asked to find the answer to a number of questions in a foreign-language document collection, and write the answers in their own native language. In the interactive image retrieval task, a picture is shown to the user, and then the user is asked to find the picture in the collection.
This paper summarizes the task design, experimental methodology, and the results obtained by the research groups participating in the track.
TL;DR: The overall approach adopted by CLEF to meet data protection requirements, including the data flows, pseudonymisation measures and additional monitoring policies that are currently being developed are summarized.
Abstract: Objectives: CLEF is an MRC sponsored project in the E-Science programme that aims to establish methodologies and a technical infrastructure for the next generation of integrated clinical and bioscience research. Methods: The heart of the CLEF approach to this challenge is to design and develop a pseudonymised repository of histories of cancer patients that can be accessed by researchers. Robust mechanisms and policies have been developed to ensure that patient privacy and confidentiality are preserved while delivering a repository of such medically rich information for the purposes of scientific research. Results: This paper summarises the overall approach adopted by CLEF to meet data protection requirements, including the data flows, pseudonymisation measures and additional monitoring policies that are currently being developed. Conclusion: Once evaluated, it is hoped that the CLEF approach can serve as a model for other distributed electronic health record repositories to be accessed for research.
TL;DR: The extensions made to the 2004 QA@CLEF German/English QA‐system are described, especially the question‐type driven selection of answer strategies.
Abstract: This report describes the work done by the QA group of the Language Technology Lab at DFKI for the 2005 edition of the Cross-Language Evaluation Forum (CLEF). We describe the extensions made to our 2004 QA@CLEF German/English QA‐system, especially the question‐type driven selection of answer strategies. Furthermore, details concerning the processing of definition and temporal questions are described, as well as the results obtained in the monolingual German, bilingual English/German, and bilingual German/English tasks are presented and discussed throughout the paper.
TL;DR: An outcomes is the considerable improvement in results if the retrieval systems make use of the thesauri provided or the intellectually assigned descriptors, and other findings for IR in a domain-specific context are given.
Abstract: The challenge of the CLEF domain-specific track is to map user queries in one language to documents in different languages adapting the systems used to the vocabulary and wording of the social science domain. In addition to a general overview of this track and its tasks, some details on the approaches of the participating groups and their results are reported. One of the outcomes is the considerable improvement in results if the retrieval systems make use of the thesauri provided or the intellectually assigned descriptors. Other findings for IR in a domain-specific context are also given. Finally, considerations on the topic creation and assessment processes are made on the basis of empirical data mainly from the GIRT corpus.
TL;DR: Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgar- ian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2005 as discussed by the authors.
Abstract: Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgar- ian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50 natural language queries, and the goal was to find all of the relevant documents (with high precision) in a particular document set. We conducted diagnostic experiments with dierent techniques for matching word variations and handling stopwords. We found that the experimental stemmers significantly increased mean average precision for the 4 languages. Analysis of individual topics found that the algorithmic Bulgar- ian and Hungarian stemmers encountered some unanticipated stopword collisions. A comparison to an experimental 4-gram technique suggested that Hungarian stemming would further benefit from decompounding. A blind feedback technique which sig- nificantly increased mean average precision for some languages was also significantly detrimental to the rank of the first relevant retrieved for one language.
TL;DR: This paper proposes an effective and general stopword list as well as a light stemming procedure for the Hungarian, Bulgarian and Portuguese (Brazilian) languages and investigates the variations in retrieval effectiveness that resulted when the authors included or excluded manually assigned keywords attached to the bibliographic records.
Abstract: For our fifth participation in the CLEF evaluation campaigns, our first objective was to propose an effective and general stopword list as well as a light stemming procedure for the Hungarian, Bulgarian and Portuguese (Brazilian) languages. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in those languages. To do so we evaluated our scheme using two probabilistic models and five vector-processing approaches. In the bilingual track, we evaluated both the machine translation and bilingual dictionary approaches applied to automatically translate a query submitted in English into various target languages. Finally, using the GIRT corpora (available in English, German and Russian), we investigated the variations in retrieval effectiveness that resulted when we included or excluded manually assigned keywords attached to the bibliographic records (mainly comprising a title and an abstract).
TL;DR: Several new approaches based on linguistic structures and statistics in the B, I/E, and B/E tasks are experimented with.
Abstract: We participated at four Question Answering tasks at CLEF 2005: the Italian monolingual (I), Italian-English (I/E), Bulgarian monolingual (B), and Bulgarian-English (B/E) bilingual task. While we did not change the approach in the Italian task (I), we experimented with several new approaches based on linguistic structures and statistics in the B, I/E, and B/E tasks.
TL;DR: For the 2005 Cross-Language Evaluation Forum, Thomson Legal and Regulatory participated in the Hungarian, French, and Portuguese monolingual search tasks as well as French-to-Portuguese bilingual retrieval.
Abstract: For the 2005 Cross-Language Evaluation Forum, Thomson Legal and Regulatory participated in the Hungarian, French, and Portuguese monolingual search tasks as well as French-to-Portuguese bilingual retrieval. Our Hungarian participation focused on comparing the effectiveness of different approaches toward morphological stemming. Our French and Portuguese monolingual efforts focused on different approaches to Pseudo-Relevance Feedback (PRF), in particular the evaluation of a scheme for selectively applying PRF only in the cases most likely to produce positive results. Our French-to-Portuguese bilingual effort applies our previous work in query translation to a new pair of languages and uses corpus-based language modeling to support term-by-term translation. We compare our approach to an off-the-self machine translation system that translates the query as a whole and find the latter approach to be more performant. All experiments were performed using our proprietary search engine. We remain encouraged by the overall success of our efforts, with our main submissions for each of the four tasks performing above the overall CLEF median. However, none of the specific enhancement techniques we attempted in this year’s forum showed significant improvements over our initial result.
TL;DR: This paper describes the main components of the system built by the DLT Group at Limerick for participation in the QA Task at CLEF and experimented with retrieval query expansion using Local Context Analysis.
Abstract: This paper describes the main components of the system built by the DLT Group at Limerick for participation in the QA Task at CLEF. The document indexing we used was again sentence-by-sentence but this year the Lucene Engine was adopted. We also experimented with retrieval query expansion using Local Context Analysis. Results were broadly similar to last year.
TL;DR: The TALP-QA system is a multilingual open-domain QA system that processes both factoid and definition questions and the approach to factoid questions is based on in-depth NLP tools and resources to create semantic information representation.
Abstract: This paper describes the TALP-QA system in the context of the CLEF 2005 Spanish Monolingual Question Answering (QA) evaluation task. TALP-QA is a multilingual open-domain QA system that processes both factoid (normal and temporally restricted) and definition questions. The approach to factoid questions is based on in-depth NLP tools and resources to create semantic information representation. Answers to definition questions are selected from the phrases that match a pattern from a manually constructed set of definitional patterns.
TL;DR: The organization of the CLEF 2005 evaluation campaign is described and details are provided concerning the tracks, test collections, evaluation infrastructure and participation.
Abstract: The organization of the CLEF 2005 evaluation campaign is described and details are provided concerning the tracks, test collections, evaluation infrastructure and participation.
TL;DR: Title field only topic statements performed better than combined Title and Description topics and absolute performance is improved by combining the automatic document transcriptions with additional metadata fields.
Abstract: The Dublin City University participation in the CLEF 2005 CL-SR task concentrated on exploring the application of our existing information retrieval methods based on the Okapi model to the conversational speech data set. This required an approach to determining approximate sentence boundaries within the free-flowing automatic transcription provided to enable us to use our summary-based pseudo relevance feedback (PRF). We also performed exploratory experiments on the use of the metadata provided with the document transcriptions for indexing and relevance feedback. Topics were translated into English using Systran V3.0 machine translation. In most cases Title field only topic statements performed better than combined Title and Description topics. PRF using our adapted method is shown to be affective, and absolute performance is improved by combining the automatic document transcriptions with additional metadata fields.
TL;DR: It is found that stemming over a broad range of suffixes especially on nouns is highly useful and therefore the ad-hoc monolingual task for CLEF 2005.
Abstract: We developed algorithmic stemmers for Hungarian and used them for the ad-hoc monolingual task for CLEF 2005. Our goal was to determine what degree of stemming is the most effective. Although on average the stemmers did not perform as well as the the best n-gram, we found that stemming over a broad range of suffixes especially on nouns is highly useful.
TL;DR: New freely available translation sources are explored together with a combined query translation approach in order to obtain a better translation of the user's information need and variations in retrieval effectiveness when including or excluding manually assigned keywords attached to bibliographic records are investigated.
Abstract: For our fifth participation in the CLEF evaluation campaigns, the first objective was to propose an effective and general stopword list along with a light stemming procedure for the Hungarian, Bulgarian and Portuguese (Brazilian) languages. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in those languages. To do so we evaluated our scheme using two probabilistic models and nine vectorprocessing approaches. In the bilingual track, we evaluated both the machine translation and bilingual dictionary approaches to automatically translate a query submitted in English into various target languages. This year we explored new freely available translation sources, together with a combined query translation approach in order to obtain a better translation of the user's information need. Finally, using the GIRT corpora (available in English, German and Russian), we investigated variations in retrieval effectiveness when including or excluding manually assigned keywords attached to bibliographic records (mainly comprising a title and an abstract).
TL;DR: The CLEF query editing interface makes use of natural language generation techniques in order to alleviate some of the problems generally faced by natural language and graphical query interfaces.
Abstract: At the centre of the Clinical e-Science Framework (CLEF) project is a repository of well organised, detailed clinical histories, encoded as data that will be available for use in clinical care and in-silico medical experiments. An integral part of the CLEF workbench is a tool to allow biomedical researchers and clinicians to query – in an intuitive way – the repository of patient data. This paper describes the CLEF query editing interface, which makes use of natural language generation techniques in order to alleviate some of the problems generally faced by natural language and graphical query interfaces. The query interface also incorporates an answer renderer that dynamically generates responses in both natural language text and graphics.
TL;DR: This work presents the participation of the University of Ottawa in the Cross-Language Spoken Document Retrieval task at CLEF 2005, and the results on phonetic transcripts of the collection and queries and on the combination of text and phonetic transcript.
Abstract: We present the participation of the University of Ottawa in the Cross-Language Spoken Document Retrieval task at CLEF 2005. In order to translate the queries, we combined the results of several online Machine Translation tools. For the Information Retrieval component we used the SMART system [1], with several weighting schemes for indexing the documents and the queries. One scheme in particular led to better results than other combinations. We present the results of the submitted runs and of many un-official runs. We compare the effect of several translations from each language. We present results on phonetic transcripts of the collection and queries and on the combination of text and phonetic transcripts. We also include the results when the manual summaries and keywords are indexed.
TL;DR: Experimental results show that while standard text-based PRF improves upon a no feedback text-only baseline, at present the new approach to combining evidence from text- based and content-based retrieval does not give further improvement.
Abstract: The aim of the Dublin City University’s participation in the CLEF 2005 ImageCLEF St Andrew’s Collection task was to explore an alternative approach to exploiting text annotation and content-based retrieval in a novel combined way for pseudo relevance feedback (PRF). This method combines evidence from retrieved lists generated using text-based and content-based retrieval to determine which documents will be assumed relevant for the PRF process. Unfortunately the experimental results show that while standard text-based PRF improves upon a no feedback text-only baseline, at present our new approach to combining evidence from text-based and content-based retrieval does not give further improvement.
TL;DR: The ImageCLEF tasks, submissions from participating groups and summarise the main fndings are described to explore the use of both text and content-based retrieval methods for cross-language image retrieval.
Abstract: The purpose of this paper is to outline efforts from the 2005 CLEF crosslanguage image retrieval campaign (ImageCLEF). The aim of this CLEF track is to explore
the use of both text and content-based retrieval methods for cross-language image retrieval. Four tasks were offered in the ImageCLEF track: a ad-hoc retrieval from an historic photographic collection, ad-hoc retrieval from a medical collection, an automatic image annotation task, and a user-centered (interactive) evaluation task that is explained in the iCLEF summary. 24 research groups from a variety of backgrounds and nationalities (14 countries) participated in ImageCLEF. In this paper we describe the ImageCLEF tasks, submissions from participating groups and summarise the main fndings.
TL;DR: In this article, the main technique is the fusion of multiple probabilistic searches against dieren t XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorithm.
Abstract: In this paper I will describe the Berkeley (group 1) approach to the GeoCLEF task for CLEF 2005. The main technique we are testing is the fusion of multiple probabilistic searches against dieren t XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorithm. We also combine multiple translations of queries in cross-language searching. Since this is the rst time that the Cheshire system has been used for CLEF this approach can, at best, be considered a very preliminary base testing of some retrieval algorithms and approaches. The primary geographically based approaches taken for GeoCLEF were to georeference proper nouns in the text using a gazetteer derived from the World Gazetteer with both English and German names for each place, and to expand place names for regions or countries in the queries by the names of the countries or cities in those regions or countries.
TL;DR: It is found that character n-grams remain an attractive option for representing documents and queries in these new languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages.
Abstract: JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.
TL;DR: This article conducted experiments with four stemmers of varying impact and found that stemmers focusing on noun inflection are as eective as more broadly oriented stemmers, and that extensive stemming is especially beneficial for Hungarian monolingual retrieval.
Abstract: We describe our ocial runs for the ad hoc monolingual task in Hungarian for CLEF 2005. We conducted experiments with four stemmers of varying impact. The experiments indicate that stemmers focusing on noun inflection are as eective as more broadly oriented stemmers, and that extensive stemming is especially beneficial for Hungarian monolingual retrieval.
TL;DR: This paper used combined dierent size passages to improve the information retrieval process at CLEF-2005 in the following languages: English, French, Portuguese, Bulgarian and Hungarian, and they have achieved good results.
Abstract: This paper describes our participation in monolingual tasks at CLEF-2005. In this research we have worked in the following languages: English, French, Portuguese, Bulgarian and Hungarian. Our task has been focused on using combined dierent size passages to improve the Information Retrieval process. Once we have studied the experiments which have been carried out and the ocial results at CLEF, we have realized that this combining model gets better the achieved scores considerably.
TL;DR: Analysis of individual topics found that the algorithmic Bulgarian and Hungarian stemmers encountered some unanticipated stopword collisions and a comparison to an experimental 4-gram technique suggested that Hungarian stemming would further benefit from decompounding.
Abstract: Hummingbird participated in the 4 monolingual information retrieval tasks (Bulgarian, French, Hungarian and Portuguese) of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2005. In the ad hoc retrieval tasks, the system was given 50 natural language queries, and the goal was to find all of the relevant documents (with high precision) in a particular document set. We conducted diagnostic experiments with different techniques for matching word variations and handling stopwords. We found that the experimental stemmers significantly increased mean average precision for the 4 languages. Analysis of individual topics found that the algorithmic Bulgarian and Hungarian stemmers encountered some unanticipated stopword collisions. A comparison to an experimental 4-gram technique suggested that Hungarian stemming would further benefit from decompounding. A blind feedback technique which significantly increased mean average precision for some languages was also significantly detrimental to the rank of the first relevant retrieved for one language.