Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

Open AccessPosted Content

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

- 18 Mar 2020

4

TL;DR: The Common Index File Format (CIFF) as mentioned in this paper is a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Pretrained Transformers for Text Ranking: BERT and Beyond

Jimmy Lin, +2 more

- 13 Oct 2020

- arXiv: Information Retrieval

TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.

...read moreread less

474

Proceedings Article•10.1145/3340531.3412762

CC-News-En: A Large English News Corpus

Joel Mackenzie, +5 more

- 19 Oct 2020

TL;DR: A static, open-access news corpus is described using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages, to support offline effectiveness experiments and hence batch evaluation campaigns.

...read moreread less

64

Proceedings Article•10.1145/3397271.3401263

Efficiency Implications of Term Weighting for Passage Retrieval

Joel Mackenzie, +3 more

- 25 Jul 2020

TL;DR: This work conducts an investigation of query processing efficiency over DeepCT indexes, revealing how term re-weighting can impact query processing latency, and exploring how DeepCT can be used as a static index pruning technique to accelerate query processing without harming search effectiveness.

...read moreread less

30

Proceedings Article•10.1145/3340531.3412773

Feature Extraction for Large-Scale Text Collections

Luke Gallagher, +4 more

- 19 Oct 2020

TL;DR: Fxt is introduced, an open-source framework to perform efficient and scalable feature extraction that can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems.

...read moreread less

References

•Journal Article•10.1145/3239571

Anserini: Reproducible Ranking Baselines Using Lucene

Peilin Yang, +2 more

- 29 Oct 2018

- Journal of Data and Information Quality

TL;DR: Anserini is described, an information retrieval toolkit built on Lucene that allows researchers to easily reproduce results with modern bag-of-words ranking models on diverse test collections and demonstrates that Lucene provides a suitable framework for supporting information retrieval research.

...read moreread less

275

Proceedings Article•10.1145/2682862.2682863

Improvements to BM25 and Language Models Examined

Andrew Trotman, +2 more

- 26 Nov 2014

TL;DR: This investigation finds that once trained (using particle swarm optimization) there is very little difference in performance between these functions, that relevance feedback is effective, that stemming is effective and that it remains unclear which function is best over-all.

...read moreread less

214

•Book Chapter•10.1007/978-3-319-30671-1_30

Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

Jimmy Lin, +8 more

- 20 Mar 2016

TL;DR: The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2, and the product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines such that with a single script, anyone with a copy of the collection can reproduce the submitted runs.

...read moreread less

112

•Proceedings Article

PISA: Performant indexes and search for academia

Antonio Mallia, +3 more

- 01 Jan 2019

TL;DR: The effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure is outlined.

...read moreread less

73

Proceedings Article•10.1145/3018661.3018726

A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation

Matt Crane, +4 more

- 02 Feb 2017

TL;DR: Overall, JASS is slightly slower than either WAND or BMW, but exhibits much lower variance in query latencies and is much less susceptible to tail query effects, making it an appealing solution for performance-sensitive applications where bounds on query latency are desirable.

...read moreread less

58