Open AccessPosted Content
Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format
Jimmy Lin,Joel Mackenzie,Chris Kamphuis,Craig Macdonald,Antonio Mallia,Michał Siedlaczek,Andrew Trotman,Arjen P. de Vries +7 more
TL;DR: The Common Index File Format (CIFF) as mentioned in this paper is a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
read more
Abstract: There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Pretrained Transformers for Text Ranking: BERT and Beyond
TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.
474
CC-News-En: A Large English News Corpus
Joel Mackenzie,Rodger Benham,Matthias Petri,Johanne R. Trippas,J. Shane Culpepper,Alistair Moffat +5 more
- 19 Oct 2020
TL;DR: A static, open-access news corpus is described using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages, to support offline effectiveness experiments and hence batch evaluation campaigns.
64
Efficiency Implications of Term Weighting for Passage Retrieval
Joel Mackenzie,Zhuyun Dai,Luke Gallagher,Jamie Callan +3 more
- 25 Jul 2020
TL;DR: This work conducts an investigation of query processing efficiency over DeepCT indexes, revealing how term re-weighting can impact query processing latency, and exploring how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.
30
Feature Extraction for Large-Scale Text Collections
Luke Gallagher,Antonio Mallia,J. Shane Culpepper,Torsten Suel,B. Barla Cambazoglu +4 more
- 19 Oct 2020
TL;DR: Fxt is introduced, an open-source framework to perform efficient and scalable feature extraction that can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.
References
Anserini: Reproducible Ranking Baselines Using Lucene
Peilin Yang,Hui Fang,Jimmy Lin +2 more
TL;DR: Anserini is described, an information retrieval toolkit built on Lucene that allows researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections and demonstrates that Lucene provides a suitable framework for supporting information retrieval research.
275
Improvements to BM25 and Language Models Examined
Andrew Trotman,Antti Puurula,Blake Burgess +2 more
- 26 Nov 2014
TL;DR: This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all.
Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge
Jimmy Lin,Matt Crane,Andrew Trotman,Jamie Callan,Ishan Chattopadhyaya,John Foley,Grant Ingersoll,Craig Macdonald,Sebastiano Vigna +8 more
- 20 Mar 2016
TL;DR: The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2, and the product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines such that with a single script, anyone with a copy of the collection can reproduce the submitted runs.
•Proceedings Article
PISA: Performant indexes and search for academia
Antonio Mallia,Michał Siedlaczek,Joel Mackenzie,Torsten Suel +3 more
- 01 Jan 2019
TL;DR: The effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure is outlined.
A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation
Matt Crane,J. Shane Culpepper,Jimmy Lin,Joel Mackenzie,Andrew Trotman +4 more
- 02 Feb 2017
TL;DR: Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects, making it an appealing solution for performance-sensitive applications where bounds on query latency are desirable.
58