TL;DR: The Harmonizome is a comprehensive resource of knowledge about genes and proteins that enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation.
Abstract: Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene-gene and attribute-attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation.Database URL: http://amp.pharm.mssm.edu/Harmonizome.
TL;DR: The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects and generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets.
Abstract: The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based ann ...
TL;DR: The BC5CDR corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.
Abstract: Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/.
TL;DR: A new, open-source implementation of ScaleNet is presented, which normalized the data model, begun the process of correcting invalid data, upgraded the user interface, and added online administrative tools to make the ScaleNet data more accurate and extendable.
Abstract: Scale insects (Hemiptera: Coccoidea) are small herbivorous insects found on all continents except Antarctica. They are extremely invasive, and many species are serious agricultural pests. They are also emerging models for studies of the evolution of genetic systems, endosymbiosis and plant-insect interactions. ScaleNet was launched in 1995 to provide insect identifiers, pest managers, insect systematists, evolutionary biologists and ecologists efficient access to information about scale insect biological diversity. It provides comprehensive information on scale insects taken directly from the primary literature. Currently, it draws from 23,477 articles and describes the systematics and biology of 8194 valid species. For 20 years, ScaleNet ran on the same software platform. That platform is no longer viable. Here, we present a new, open-source implementation of ScaleNet. We have normalized the data model, begun the process of correcting invalid data, upgraded the user interface, and added online administrative tools. These improvements make ScaleNet easier to use and maintain and make the ScaleNet data more accurate and extendable. Database URL: http://scalenet.info.
TL;DR: This task was found to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction.
Abstract: Manually curating chemicals, diseases and their relationships is significantly important to biomedical research, but it is plagued by its high cost and the rapid growth of the biomedical literature. In recent years, there has been a growing interest in developing computational approaches for automatic chemical-disease relation (CDR) extraction. Despite these attempts, the lack of a comprehensive benchmarking dataset has limited the comparison of different techniques in order to assess and advance the current state-of-the-art. To this end, we organized a challenge task through BioCreative V to automatically extract CDRs from the literature. We designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task—a result that approaches the human inter-annotator agreement (0.8875)—and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. When combining team results via machine learning, the ensemble system was able to further improve over the best team results by achieving 88.89% and 62.80% in F-score for the DNER and CID task, respectively. Additionally, another novel aspect of our evaluation is to test each participating system’s ability to return real-time results: the average response time for each team’s DNER and CID web service systems were 5.6 and 9.3 s, respectively. Most teams used hybrid systems for their submissions based on machining learning. Given the level of participation and results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction.
Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
TL;DR: A database named ATtRACT is developed that compiles information on 370 RBPs and 1583 RBP consensus binding motifs, 192 of which are not present in any other database, and provides efficient algorithms to search a specific motif and scan one or more RNA sequences at a time.
Abstract: This work was supported by grants from the European Union's Seventh Framework Programme (CardioNext ITN-608027, CardioNeT ITN-289600), from the Spanish Ministry of Science and Innovation (SAF2012-31451) and from the Regional Government of Madrid (2010-BMD-2321 `Fibroteam') to E.L.-P.). Funding for open access charge: EU FP7, grant CardioNext ITN-608027.
TL;DR: Compared to other interaction databases, the data collection approach ensures HPIDB 2.0 users access the most comprehensive HPI data from a wide range of pathogens and their hosts (594 pathogen and 70 hosts, as of February 2016).
Abstract: Identification and analysis of host–pathogen interactions (HPI) is essential to study infectious diseases. However, HPI data are sparse in existing molecular interaction databases, especially for agricultural host–pathogen systems. Therefore, resources that annotate, predict and display the HPI that underpin infectious diseases are critical for developing novel intervention strategies. HPIDB 2.0 (http://www.agbase.msstate.edu/hpi/main.html) is a resource for HPI data, and contains 45, 238 manually curated entries in the current release. Since the first description of the database in 2010, multiple enhancements to HPIDB data and interface services were made that are described here. Notably, HPIDB 2.0 now provides targeted biocuration of molecular interaction data. As a member of the International Molecular Exchange consortium, annotations provided by HPIDB 2.0 curators meet community standards to provide detailed contextual experimental information and facilitate data sharing. Moreover, HPIDB 2.0 provides access to rapidly available community annotations that capture minimum molecular interaction information to address immediate researcher needs for HPI network analysis. In addition to curation, HPIDB 2.0 integrates HPI from existing external sources and contains tools to infer additional HPI where annotated data are scarce. Compared to other interaction databases, our data collection approach ensures HPIDB 2.0 users access the most comprehensive HPI data from a wide range of pathogens and their hosts (594 pathogen and 70 host species, as of February 2016). Improvements also include enhanced search capacity, addition of Gene Ontology functional information, and implementation of network visualization. The changes made to HPIDB 2.0 content and interface ensure that users, especially agricultural researchers, are able to easily access and analyse high quality, comprehensive HPI data. All HPIDB 2.0 data are updated regularly, are publically available for direct download, and are disseminated to other molecular interaction resources.
Database URL: http://www.agbase.msstate.edu/hpi/main.html
TL;DR: The application of recent advances in proteomics to enhance gene annotation and classification in GeneCards is reported and a new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in theGeneCards Suite.
Abstract: GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/.
TL;DR: The NPInter database is updated to version 3.0, which contains experimentally verified interactions between ncRNAs (excluding tRNAs and rRNAs), especially long noncoding RNAs (lncRNAs) and other biomolecules and new web services are added, including a local UCSC Genome Browser to visualize binding sites.
Abstract: Despite the fact that a large quantity of noncoding RNAs (ncRNAs) have been identified, their functions remain unclear. To enable researchers to have a better understanding of ncRNAs' functions, we updated the NPInter database to version 3.0, which contains experimentally verified interactions between ncRNAs (excluding tRNAs and rRNAs), especially long noncoding RNAs (lncRNAs) and other biomolecules (proteins, mRNAs, miRNAs and genomic DNAs). In NPInter v3.0, interactions pertaining to ncRNAs are not only manually curated from scientific literature but also curated from high-throughput technologies. In addition, we also curated lncRNA-miRNA interactions fromin silicopredictions supported by AGO CLIP-seq data. When compared with NPInter v2.0, the interactions are more informative (with additional information on tissues or cell lines, binding sites, conservation, co-expression values and other features) and more organized (with divisions on data sets by data sources, tissues or cell lines, experiments and other criteria). NPInter v3.0 expands the data set to 491,416 interactions in 188 tissues (or cell lines) from 68 kinds of experimental technologies. NPInter v3.0 also improves the user interface and adds new web services, including a local UCSC Genome Browser to visualize binding sites. Additionally, NPInter v3.0 defined a high-confidence set of interactions and predicted the functions of lncRNAs in human and mouse based on the interactions curated in the database. NPInter v3.0 is available athttp://www.bioinfo.org/NPInter/Database URL:http://www.bioinfo.org/NPInter/.
TL;DR: ‘HistoneDB 2.0 – with variants’ is a comprehensive database of histone protein sequences, classified by histone types and variants, supplemented by rich sequence and structural annotations with many interactive tools to explore and compare sequences of different variants from various organisms.
Abstract: Compaction of DNA into chromatin is a characteristic feature of eukaryotic organisms. The core (H2A, H2B, H3, H4) and linker (H1) histone proteins are responsible for this compaction through the formation of nucleosomes and higher order chromatin aggregates. Moreover, histones are intricately involved in chromatin functioning and provide a means for genome dynamic regulation through specific histone variants and histone post-translational modifications. ‘HistoneDB 2.0 – with variants’ is a comprehensive database of histone protein sequences, classified by histone types and variants. All entries in the database are supplemented by rich sequence and structural annotations with many interactive tools to explore and compare sequences of different variants from various organisms. The core of the database is a manually curated set of histone sequences grouped into 30 different variant subsets with variant-specific annotations. The curated set is supplemented by an automatically extracted set of histone sequences from the non-redundant protein database using algorithms trained on the curated set. The interactive web site supports various searching strategies in both datasets: browsing of phylogenetic trees; on-demand generation of multiple sequence alignments with feature annotations; classification of histone-like sequences and browsing of the taxonomic diversity for every histone variant. HistoneDB 2.0 is a resource for the interactive comparative analysis of histone protein sequences and their implications for chromatin function.
Database URL: http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0
TL;DR: This article describes BioSharing, a manually curated, searchable portal of three linked registries that harnesses community curation to collate and cross-reference resources across the life sciences from around the world, with a particular focus on community-led curation.
Abstract: BioSharing (http://www.biosharing.org) is a manually curated, searchable portal of three linked registries. These resources cover standards (terminologies, formats and models, and reporting guidelines), databases, and data policies in the life sciences, broadly encompassing the biological, environmental and biomedical sciences. Launched in 2011 and built by the same core team as the successful MIBBI portal, BioSharing harnesses community curation to collate and cross-reference resources across the life sciences from around the world. BioSharing makes these resources findable and accessible (the core of the FAIR principle). Every record is designed to be interlinked, providing a detailed description not only on the resource itself, but also on its relations with other life science infrastructures. Serving a variety of stakeholders, BioSharing cultivates a growing community, to which it offers diverse benefits. It is a resource for funding bodies and journal publishers to navigate the metadata landscape of the biological sciences; an educational resource for librarians and information advisors; a publicising platform for standard and database developers/curators; and a research tool for bench and computer scientists to plan their work. BioSharing is working with an increasing number of journals and other registries, for example linking standards and databases to training material and tools. Driven by an international Advisory Board, the BioSharing user-base has grown by over 40% (by unique IP address), in the last year thanks to successful engagement with researchers, publishers, librarians, developers and other stakeholders via several routes, including a joint RDA/Force11 working group and a collaboration with the International Society for Biocuration. In this article, we describe BioSharing, with a particular focus on community-led curation.Database URL: https://www.biosharing.org.
TL;DR: The implementation of global pharmacological heatmap is reported, supporting a user-friendly navigation of chemogenomics space and enabling secondary pharmacological profiling predictions from molecular structure.
Abstract: ChemProt is a publicly available compilation of chemical-protein-disease annotation resources that enables the study of systems pharmacology for a small molecule across multiple layers of complexity from molecular to clinical levels. In this third version, ChemProt has been updated to more than 1.7 million compounds with 7.8 million bioactivity measurements for 19,504 proteins. Here, we report the implementation of global pharmacological heatmap, supporting a user-friendly navigation of chemogenomics space. This facilitates the visualization and selection of chemicals that share similar structural properties. In addition, the user has the possibility to search by compound, target, pathway, disease and clinical effect. Genetic variations associated to target proteins were integrated, making it possible to plan pharmacogenetic studies and to suggest human response variability to drug. Finally, Quantitative Structure-Activity Relationship models for 850 proteins having sufficient data were implemented, enabling secondary pharmacological profiling predictions from molecular structure. Database URL: http://potentia.cbs.dtu.dk/ChemProt/.
TL;DR: A novel developed data submission tool was made available for the consortium that features a high level of automation to lower the barriers of data publication and enable PGP to fulfil the FAIR data principles—findable, accessible, interoperable, reusable.
Abstract: Plant genomics and phenomics represents the most promising tools for accelerating yield gains and overcoming emerging crop productivity bottlenecks. However, accessing this wealth of plant diversity requires the characterization of this material using state-of-the-art genomic, phenomic and molecular technologies and the release of subsequent research data via a long-term stable, open-access portal. Although several international consortia and public resource centres offer services for plant research data management, valuable digital assets remains unpublished and thus inaccessible to the scientific community. Recently, the Leibniz Institute of Plant Genetics and Crop Plant Research and the German Plant Phenotyping Network have jointly initiated the Plant Genomics and Phenomics Research Data Repository (PGP) as infrastructure to comprehensively publish plant research data. This covers in particular cross-domain datasets that are not being published in central repositories because of its volume or unsupported data scope, like image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents.The repository is hosted at Leibniz Institute of Plant Genetics and Crop Plant Research using e!DAL as software infrastructure and a Hierarchical Storage Management System as data archival backend. A novel developed data submission tool was made available for the consortium that features a high level of automation to lower the barriers of data publication. After an internal review process, data are published as citable digital object identifiers and a core set of technical metadata is registered at DataCite. The used e!DAL-embedded Web frontend generates for each dataset a landing page and supports an interactive exploration. PGP is registered as research data repository at BioSharing.org, re3data.org and OpenAIRE as valid EU Horizon 2020 open data archive. Above features, the programmatic interface and the support of standard metadata formats, enable PGP to fulfil the FAIR data principles-findable, accessible, interoperable, reusable.Database URL:http://edal.ipk-gatersleben.de/repos/pgp/.
TL;DR: The development of the interactive task, from planning to execution, is described and major findings for the systems tested are discussed.
Abstract: Fully automated text mining (TM) systems promote efficient literature searching, retrieval, and review but are not sufficient to produce ready-to-consume curated documents. These systems are not meant to replace biocurators, but instead to assist them in one or more literature curation steps. To do so, the user interface is an important aspect that needs to be considered for tool adoption. The BioCreative Interactive task (IAT) is a track designed for exploring user-system interactions, promoting development of useful TM tools, and providing a communication channel between the biocuration and the TM communities. In BioCreative V, the IAT track followed a format similar to previous interactive tracks, where the utility and usability of TM tools, as well as the generation of use cases, have been the focal points. The proposed curation tasks are user-centric and formally evaluated by biocurators. In BioCreative V IAT, seven TM systems and 43 biocurators participated. Two levels of user participation were offered to broaden curator involvement and obtain more feedback on usability aspects. The full level participation involved training on the system, curation of a set of documents with and without TM assistance, tracking of time-on-task, and completion of a user survey. The partial level participation was designed to focus on usability aspects of the interface and not the performance per se. In this case, biocurators navigated the system by performing pre-designed tasks and then were asked whether they were able to achieve the task and the level of difficulty in completing the task. In this manuscript, we describe the development of the interactive task, from planning to execution and discuss major findings for the systems tested.
TL;DR: This article presents a system for disease named entity recognition (DNER) and normalization that achieves an F1-measure of 0.8428 at the mention level and 0.7804 at the concept level on the testing data of the chemical-disease relation task in BioCreative V.
Abstract: The recognition of disease and chemical named entities in scientific articles is a very important subtask in information extraction in the biomedical domain. Due to the diversity and complexity of disease names, the recognition of named entities of diseases is rather tougher than those of chemical names. Although there are some remarkable chemical named entity recognition systems available online such as ChemSpot and tmChem, the publicly available recognition systems of disease named entities are rare. This article presents a system for disease named entity recognition (DNER) and normalization. First, two separate DNER models are developed. One is based on conditional random fields model with a rule-based post-processing module. The other one is based on the bidirectional recurrent neural networks. Then the named entities recognized by each of the DNER model are fed into a support vector machine classifier for combining results. Finally, each recognized disease named entity is normalized to a medical subject heading disease name by using a vector space model based method. Experimental results show that using 1000 PubMed abstracts for training, our proposed system achieves an F1-measure of 0.8428 at the mention level and 0.7804 at the concept level, respectively, on the testing data of the chemical-disease relation task in BioCreative V.Database URL: http://219.223.252.210:8080/SS/cdr.html.
TL;DR: The new version of CoDNaS includes ∼70% of all available protein structures, and new tools have been added that run sequence searches, display structural flexibility profiles and allow users to browse the database for different structural classes.
Abstract: CoDNaS (conformational diversity of the native state) is a protein conformational diversity database. Conformational diversity describes structural differences between conformers that define the native state of proteins. It is a key concept to understand protein function and biological processes related to protein functions. CoDNaS offers a well curated database that is experimentally driven, thoroughly linked, and annotated. CoDNaS facilitates the extraction of key information on small structural differences based on protein movements. CoDNaS enables users to easily relate the degree of conformational diversity with physical, chemical and biological properties derived from experiments on protein structure and biological characteristics. The new version of CoDNaS includes ∼70% of all available protein structures, and new tools have been added that run sequence searches, display structural flexibility profiles and allow users to browse the database for different structural classes. These tools facilitate the exploration of protein conformational diversity and its role in protein function.
Database URL: http://ufq.unq.edu.ar/codnas
TL;DR: The current version of 2P2Idb contains 27 protein–protein complexes and 274 protein–inhibitor complexes corresponding to 242 unique small molecule inhibitors which represent almost a 5-fold increase compared to the previous version.
Abstract: 2P2Idb is a hand-curated structural database dedicated to protein-protein interactions with known small molecule orthosteric modulators. It compiles the structural information related to orthosteric inhibitors and their target [i.e. related 3D structures available in the RCSB Protein Data Bank (PDB)] and provides links to other useful databases. 2P2Idb includes all interactions for which both the protein-protein and protein-inhibitor complexes have been structurally characterized. Since its first release in 2010, the database has grown constantly and the current version contains 27 protein-protein complexes and 274 protein-inhibitor complexes corresponding to 242 unique small molecule inhibitors which represent almost a 5-fold increase compared to the previous version. A number of new data have been added, including new protein-protein complexes, binding affinities, molecular descriptors, precalculated interface parameters and links to other webservers. A new query tool has been implemented to search for inhibitors within the database using standard molecular descriptors. A novel version of the 2P2I-inspector tool has been implemented to calculate a series of physical and chemical parameters of the protein interfaces. Several geometrical parameters including planarity, eccentricity and circularity have been added as well as customizable distance cutoffs. This tool has also been extended to protein-ligand interfaces. The 2P2I database thus represents a wealth of structural source of information for scientists interested in the properties of protein-protein interactions and the design of protein-protein interaction modulators. Database URL: http://2p2idb.cnrs-mrs.fr.
TL;DR: The new data standard presented here build upon existing standards commonly used within the community extending them with the capability to exchange data on tissue, environmental and DNA sample as well as sequences.
Abstract: Genomic samples of non-model organisms are becoming increasingly important in a broad range of studies from developmental biology, biodiversity analyses, to conservation. Genomic sample definition, description, quality, voucher information and metadata all need to be digitized and disseminated across scientific communities. This information needs to be concise and consistent in today's ever-increasing bioinformatic era, for complementary data aggregators to easily map databases to one another. In order to facilitate exchange of information on genomic samples and their derived data, the Global Genome Biodiversity Network (GGBN) Data Standard is intended to provide a platform based on a documented agreement to promote the efficient sharing and usage of genomic sample material and associated specimen information in a consistent way. The new data standard presented here build upon existing standards commonly used within the community extending them with the capability to exchange data on tissue, environmental and DNA sample as well as sequences. The GGBN Data Standard will reveal and democratize the hidden contents of biodiversity biobanks, for the convenience of everyone in the wider biobanking community. Technical tools exist for data providers to easily map their databases to the standard.Database URL: http://terms.tdwg.org/wiki/GGBN_Data_Standard.
TL;DR: The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data, and the honeycomb data model developed for HIVE integrates metadata into an object-oriented model.
Abstract: The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure.The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL: https://hive.biochemistry.gwu.edu.
TL;DR: A machine learning based system that utilized simple yet effective linguistic features to extract relations with maximum entropy models and the hypernym relations between entity concepts derived from the Medical Subject Headings (MeSH)-controlled vocabulary to obtain more accurate classification models and better extraction performance.
Abstract: Table 2 compares the performance of relation extraction at intra-sentence level and inter-sentence level, as well as the performance of the final results at document level on the development set using gold entity annotations. ‘LEX’, ‘DEP’ and ‘HF’ denote lexical features, dependency features and the hypernym filtering step mentioned in Section ‘Methods’. When comparing different levels of relations, the approach only using the lexical features is regarded as the baseline.
Table 2.
Performance on the development dataset
Note that DEP was unavailable for inter-sentence level, while HF and LEX could be applied to both levels. Post-processing was executed at document level based on the optimal feature combination after the relation merging step, i.e. ‘HF+LEX + DEP’. The table indicates that:
Only using the lexical features, the final performance of F-score was able to reach as high as 55.3%, and the performance at intra-sentence level was much higher than that at inter-sentence level. This suggests that lexical features were simple yet more effective for intra-sentence level than for inter-sentence level. This is probably because the CID relations at inter-sentence level spanned several sentences and thus had much more complex structures that the traditional lexical features could not capture effectively.
Though the performance by dependency features was slightly lower than that by lexical features, its F-score still reached as high as 60.2%. This is probably because of its capability to represent the direct syntactic relationships between different entity mentions in a sentence.
On the basis of lexical or dependency features, hypernym filtering significantly improved the recall for both intra- and inter-sentence levels, leading to the F-scores of 66.1% and 42.3% for two levels, respectively. This indicates that filtering the more general negative instances from the training set caused more true relation instances to be recalled, justifying our hypothesis in the Section ‘Hypernym filtering for training instances’.
Combining HF, LEX and DEP, our system achieved the best performance for relation extraction. After merging the relations from mention level to document level, the F-score reached as high as 59.2%.
After post-processing, the F-score further reached as high as 60.4%. The minor decrease in the recall may be caused by the fact that there were some false annotations for the relations with more general entities.
To understand why the task is challenging, we have closely examined the errors and grouped the reasons as follows:
■ For intra-sentence level:
Lexical sparsity: Sentences that describe the CID relations using rarely occurring words may not be captured effectively. For instance, in the sentence ‘Fatal haemorrhagic myocarditis secondary to cyclophosphamide therapy.’ (PMID: 11271907), the key clue ‘… secondary to …’ occurs less frequently in the corpus.
The structure of sentence is complicated: If a sentence has a complicated structure, our method may not extract the CID relations correctly. For instance, in the sentence ‘The epidemiologic findings are most consistent with the hypothesis that chronic cocaine use disrupts dopaminergic function and, when coupled with recent cocaine use, may precipitate agitation, delirium, aberrant thermoregulation, rhabdomyolysis, and sudden death.’ (PMID: 8988571), though the relation between ‘cocaine’ (D003042) and ‘sudden death’ (D003645) is true, the token distance is too long and there are conjunction structures between mentions in the sentence.
True relations are neglected in annotation: A close-up analysis on the results shows that some of our false-positive predictions are actually true-positive. For instance, in the sentence ‘This increase in aggressiveness was not secondary to METH-induced hyperactivity.’(PMID: 16192988), the relation between ‘METH’ (D008694) and ‘hyperactivity’ (D006948) was extracted by our system. This relation is not annotated in the document; however, it is actually annotated in the documents of PMID: 15764424 and PMID: 10579464.
Inconsistent annotation: Correlated with the same entity, some relations are annotated while others are not. For instance, in the sentence ‘One patient group developed sinus tachycardias in the setting of a massive carbamazepine overdose.’(PMID: 1728915), the relation between ‘carbamazepine’ (D002220) and ‘overdose’ (D062787) is not annotated; however, in the sentence ‘The possibility of choreoathetoid movements should be considered in patients presenting after pemolineoverdose.’(PMID: 9022662), the relation between ‘pemoline’ (D010389) and ‘overdose’ (D062787) is annotated.
■ For inter-sentence level:
Discourse inference is needed: This is the most common error type at inter-sentence level. The inter-sentence level relations are expressed spanning multiple sentences, thus discourse inference including co-reference resolution is needed for the relation extraction. For instance, in two sentences ‘Adverse events considered to be related to levofloxacin administration were reported by 29 patients (9%). The most common drug-related adverse events were diarrhea, flatulence, and nausea; most adverse events were mild to moderate in severity.’ The relation between ‘levofloxacin’ (D064704) and ‘flatulence’ (D005414) is true, while the phrase of ‘Adverse events’ is the anchor bridging the two entities.
Inconsistent annotation: Correlated with the same entity, some relationships are annotated while others are not. This problem is similar to that at intra-sentence level.
TL;DR: ‘RE:fine Drugs’ demonstrates the possibilities to identify and prioritize novelty of candidates for drug repurposing based on the theory of transitive Drug–Gene–Disease triads.
Abstract: The process of discovering new drugs has been extremely costly and slow in the last decades despite enormous investment in pharmaceutical research. Drug repurposing enables researchers to speed up the process of discovering other conditions that existing drugs can effectively treat, with low cost and fast FDA approval. Here, we introduce 'RE:fine Drugs', a freely available interactive website for integrated search and discovery of drug repurposing candidates from GWAS and PheWAS repurposing datasets constructed using previously reported methods in Nature Biotechnology. 'RE:fine Drugs' demonstrates the possibilities to identify and prioritize novelty of candidates for drug repurposing based on the theory of transitive Drug-Gene-Disease triads. This public website provides a starting point for research, industry, clinical and regulatory communities to accelerate the investigation and validation of new therapeutic use of old drugs.Database URL: http://drug-repurposing.nationwidechildrens.org.
TL;DR: The human epigenetics drug database (HEDD) focuses on the storage and integration of epigenetic drug datasets obtained from laboratory experiments and manually curated information and incorporates five kinds of datasets.
Abstract: Epigenetic drugs are chemical compounds that target disordered post-translational modification of histone proteins and DNA through enzymes, and the recognition of these changes by adaptor proteins. Epigenetic drug-related experimental data such as gene expression probed by high-throughput sequencing, co-crystal structure probed by X-RAY diffraction and binding constants probed by bio-assay have become widely available. The mining and integration of multiple kinds of data can be beneficial to drug discovery and drug repurposing. HEMD and other epigenetic databases store comprehensively epigenetic data where users can acquire segmental information of epigenetic drugs. However, some data types such as high-throughput datasets are not provide by these databases and they do not support flexible queries for epigenetic drug-related experimental data. Therefore, in reference to HEMD and other epigenetic databases, we developed a relatively comprehensive database for human epigenetic drugs. The human epigenetic drug database (HEDD) focuses on the storage and integration of epigenetic drug datasets obtained from laboratory experiments and manually curated information. The latest release of HEDD incorporates five kinds of datasets: (i) drug, (ii) target, (iii) disease, (vi) high-throughput and (v) complex. In order to facilitate data extraction, flexible search options were built in HEDD, which allowed an unlimited condition query for specific kinds of datasets using drug names, diseases and experiment types.Database URL: http://hedds.org/.
TL;DR: An overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work and gives a special focus to resources that are interoperable.
Abstract: Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability.
TL;DR: An interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples, and speeds up annotation by 15–25% and helpsCurators to detect terms that would otherwise have been missed.
Abstract: The microbial and molecular ecology research communities have made substantial progress on developing standards for annotating samples with environment metadata. However, sample manual annotation is a highly labor intensive process and requires familiarity with the terminologies used. We have therefore developed an interactive annotation tool, EXTRACT, which helps curators identify and extract standard-compliant terms for annotation of metagenomic records and other samples. Behind its web-based user interface, the system combines published methods for named entity recognition of environment, organism, tissue and disease terms. The evaluators in the BioCreative V Interactive Annotation Task found the system to be intuitive, useful, well documented and sufficiently accurate to be helpful in spotting relevant text passages and extracting organism and environment terms. Comparison of fully manual and text-mining-assisted curation revealed that EXTRACT speeds up annotation by 15–25% and helps curators to detect terms that would otherwise have been missed.
Database URL: https://extract.hcmr.gr/
TL;DR: It is argued that text-mining technologies have become essential tools in real-world biomedical research and called for increased collaboration between text- mining researchers and various stakeholders, including researchers, publishers and biocurators.
Abstract: Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators.
TL;DR: A Biomedical entity Relation ONcology COrpus (BRONCO) is introduced that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research and quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods.
Abstract: Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance.
Database URL: http://infos.korea.ac.kr/bronco
TL;DR: This work developed and deployed a suite of openly accessible software tools in order to collate a full set of trait descriptions and extract two key traits, body length and mass, from >18 million specimen records in VertNet, a global biodiversity data publisher and aggregator.
Abstract: For vast areas of the globe and large parts of the tree of life, data needed to inform trait diversity is incomplete. Such trait data, when fully assembled, however, form the link between the evolutionary history of organisms, their assembly into communities, and the nature and functioning of ecosystems. Recent efforts to close data gaps have focused on collating trait-by-species databases, which only provide species-level, aggregated value ranges for traits of interest and often lack the direct observations on which those ranges are based. Perhaps under-appreciated is that digitized biocollection records collectively contain a vast trove of trait data measured directly from individuals, but this content remains hidden and highly heterogeneous, impeding discoverability and use. We developed and deployed a suite of openly accessible software tools in order to collate a full set of trait descriptions and extract two key traits, body length and mass, from >18 million specimen records in VertNet, a global biodiversity data publisher and aggregator. We tested success rate of these tools against hand-checked validation data sets and characterized quality and quantity. A post-processing toolkit was developed to standardize and harmonize data sets, and to integrate this improved content into VertNet for broadest reuse. The result of this work was to add more than 1.5 million harmonized measurements on vertebrate body mass and length directly to specimen records. Rates of false positives and negatives for extracted data were extremely low. We also created new tools for filtering, querying, and assembling this research-ready vertebrate trait content for view and download. Our work has yielded a novel database and platform for harmonized trait content that will grow as tools introduced here become part of publication workflows. We close by noting how this effort extends to new communities already developing similar digitized content.Database URL: http://portal.vertnet.org/search?advanced=1.
TL;DR: SlideBase is a web tool which offers a new way of selecting genes, promoters, enhancers and microRNAs that are preferentially expressed/used in a specified set of cells/tissues, based on the use of interactive sliders.
Abstract: Genomics consortia have produced large datasets profiling the expression of genes, micro-RNAs, enhancers and more across human tissues or cells. There is a need for intuitive tools to select subsets of such data that is the most relevant for specific studies. To this end, we present SlideBase, a web tool which offers a new way of selecting genes, promoters, enhancers and microRNAs that are preferentially expressed/used in a specified set of cells/tissues, based on the use of interactive sliders. With the help of sliders, SlideBase enables users to define custom expression thresholds for individual cell types/tissues, producing sets of genes, enhancers etc. which satisfy these constraints. Changes in slider settings result in simultaneous changes in the selected sets, updated in real time. SlideBase is linked to major databases from genomics consortia, including FANTOM, GTEx, The Human Protein Atlas and BioGPS.Database URL: http://slidebase.binf.ku.dk.
TL;DR: Dommino 2.0 as mentioned in this paper is a structural database of macro-molecular INteractiOns (i.e., macromolecular interactions between proteins, DNA and RNA molecules).
Abstract: Macromolecular interactions are formed between proteins, DNA and RNA molecules. Being a principle building block in macromolecular assemblies and pathways, the interactions underlie most of cellular functions. Malfunctioning of macromolecular interactions is also linked to a number of diseases. Structural knowledge of the macromolecular interaction allows one to understand the interaction’s mechanism, determine its functional implications and characterize the effects of genetic variations, such as single nucleotide polymorphisms, on the interaction. Unfortunately, until now the interactions mediated by different types of macromolecules, e.g. protein–protein interactions or protein–DNA interactions, are collected into individual and unrelated structural databases. This presents a significant obstacle in the analysis of macromolecular interactions. For instance, the homogeneous structural interaction databases prevent scientists from studying structural interactions of different types but occurring in the same macromolecular complex. Here, we introduce DOMMINO 2.0, a structural Database Of Macro-Molecular INteractiOns. Compared to DOMMINO 1.0, a comprehensive database on protein-protein interactions, DOMMINO 2.0 includes the interactions between all three basic types of macromolecules extracted from PDB files. DOMMINO 2.0 is automatically updated on a weekly basis. It currently includes ∼1 040 000 interactions between two polypeptide subunits (e.g. domains, peptides, termini and interdomain linkers), ∼43 000 RNA-mediated interactions, and ∼12 000 DNA-mediated interactions. All protein structures in the database are annotated using SCOP and SUPERFAMILY family annotation. As a result, protein-mediated interactions involving protein domains, interdomain linkers, C- and N- termini, and peptides are identified. Our database provides an intuitive web interface, allowing one to investigate interactions at three different resolution levels: whole subunit network, binary interaction and interaction interface.
Database URL: http://dommino.org
TL;DR: PIPE, an interaction pattern generation module used in the Collaborative Biocurator Assistant Task at BioCreative V to capture frequent protein-protein interaction (PPI) patterns within text is proposed and outperforms several well-known PPI extraction methods.
Abstract: Identifying the interactions between proteins mentioned in biomedical literatures is one of the frequently discussed topics of text mining in the life science field. In this article, we propose PIPE, an interaction pattern generation module used in the Collaborative Biocurator Assistant Task at BioCreative V (http://www.biocreative.org/) to capture frequent protein-protein interaction (PPI) patterns within text. We also present an interaction pattern tree (IPT) kernel method that integrates the PPI patterns with convolution tree kernel (CTK) to extract PPIs. Methods were evaluated on LLL, IEPA, HPRD50, AIMed and BioInfer corpora using cross-validation, cross-learning and cross-corpus evaluation. Empirical evaluations demonstrate that our method is effective and outperforms several well-known PPI extraction methods.
Database URL: