TL;DR: This special issue is devoted to the TREC Genomics Track, which ran from 2003 to 2007, and has expanded in recent years with the growth of new infor-mation needs.
Abstract: Recent advances in biotechnology have changed the fundamental nature of biologicalresearch. Whereas scientists used to be able to manage their modest amount of experi-mental data in paper notebooks or simple spreadsheets, new tools such as gene chips formeasuring gene expression (Mobasheri et al. 2004) or sequence variation (Pennisi 2007)have fundamentally altered their work. Not only do these gene chips generate massiveamounts of data (as much as tens of thousands of data points per biological sample), theyuncover potential associations and interactions with a wide variety of genes, diseases, andother biological entities. The field devoted to managing, utilizing, and evaluating this datais called bioinformatics (Baxevanis and Ouellette 2005), which is sometimes described asthe intersection of biology (or biomedicine) and computer science.The growth of biological data has resulted in a correspondingly large increase in sci-entific knowledge in what biologists sometimes call the bibliome or literature of biology.This requires new approaches to dealing with the biomedical literature, which is the mainpoint of intersection between this field and that of information retrieval (IR) and relateddisciplines such as text mining.In the early part of this decade, it became apparent that this situation was ripe for a trackat the Text REtrieval Conference (TREC, www.trec.nist.gov), a challenge evaluation forIR organized by the U.S. National Institute of Standards and Technology (NIST,http://www.nist.gov/) (Voorhees and Harman 2005). Started in 1992, TREC has provided aseries of challenge evaluations and a forum for presentation of their results. TREC isorganized as an annual event at which the tasks are specified and queries and documentsare provided to participants. While TREC has historically focused most of its research ontextual documents, the field has expanded in recent years with the growth of new infor-mation needs (e.g., question-answering, cross-lingual), data types (e.g., sequence data,video) and platforms (e.g., the Web) (Hersh 2003). This special issue is devoted to theTREC Genomics Track, which ran from 2003 to 2007.
TL;DR: Various approaches to ad hoc retrieval provide a diversity of efficacy and the TREC Genomics Track and its test collection resources provide tools that allow improvement in information retrieval systems.
Abstract: The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of the Text Retrieval Conference (TREC) 2004, a forum for evaluation of IR research systems, where retrieval in the genomics domain has recently begun to be assessed. A total of 27 research groups submitted 47 different runs. The most effective runs, as measured by the primary evaluation measure of mean average precision (MAP), used a combination of domain-specific and general techniques. The best MAP obtained by any run was 0.4075. Techniques that expanded queries with gene name lists as well as words from related articles had the best efficacy. However, many runs performed more poorly than a simple baseline run, indicating that careful selection of system features is essential. Various approaches to ad hoc retrieval provide a diversity of efficacy. The TREC Genomics Track and its test collection resources provide tools that allow improvement in information retrieval systems.
TL;DR: Not only was research hypothesis-driven, but the presentation of its results was as well, in the sense that data were generally offered in the context of some hypothesis, with considerable ‘value-added’ explanation, interpretation, and speculation.
Abstract: Consider the natural history of the communication of biological data. In the pre-genomic era, the scientific literature was essentially the only means by which experimental results were conveyed to the community at large. Data laboriously acquired by what we now view as ‘low-throughput’ methods were filtered through the experimentalist’s expository and interpretive medium, ie, the standard journal article. Not only was research hypothesis-driven, but the presentation of its results was as well, in the sense that data were generally offered in the context of some hypothesis, with considerable ‘value-added’ explanation, interpretation, and speculation.
TL;DR: It is found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership.
Abstract: Literature mining is expected to help not only with automatically sifting through huge biomedical literature and annotation databases, but also with linking bio-chemical entities to appropriate functional hypotheses However, there has been very limited success in testing literature mining methods due to the lack of large, objectively validated test sets or "gold standards" To improve this situation we created a large-scale test of literature mining methods and resources We report on a specific implementation of this test: how well can the Pfam protein family classification be replicated from independently mining different literature/annotation resources? We test and compare different keyterm sets as well as different algorithms for issuing protein family predictions We find that protein families can indeed be automatically predicted from the literature Using words from PubMed abstracts, of 3663 proteins tested, over 75% were correctly assigned to one of 618 Pfam families For 90% of proteins the correct Pfam family was among the top 5 ranked families We found that protein family prediction is far superior with keywords extracted from PubMed abstracts than with GO annotations or MeSH keyterms, suggesting that the text itself (in combination with the vector space model) is superior to GO and MeSH as a literature mining resources, at least for detecting protein family membership Finally, we show that Shannon's entropy can be exploited to improve prediction by facilitating the integration of the different literature sources tested
TL;DR: In this paper, a semi-automatic machine learning workflow and a NER+Ontology boosting technique for the automatic classification of biomedical literature is proposed, which has been proven in the curation of 4,115 gluten-related documents extracted from PubMed and contrasted against the word embedding alternative.