Top 205 papers published in the topic of Biological data in 2014

Showing papers on "Biological data published in 2014"

Journal Article•10.1177/0748730414553029•

Detecting rhythms in time series with RAIN.

[...]

Paul Florian Thaben¹, Pål O. Westermark¹•Institutions (1)

17 Oct 2014-Journal of Biological Rhythms

TL;DR: RAIN is presented, a robust nonparametric method for the detection of rhythms of prespecified periods in biological data that can detect arbitrary wave forms and the sets of transcripts and proteins with rhythmic abundances were significantly expanded due to the increased detection power.

...read moreread less

Abstract: A fundamental problem in research on biological rhythms is that of detecting and assessing the significance of rhythms in large sets of data. Classic methods based on Fourier theory are often hampered by the complex and unpredictable characteristics of experimental and biological noise. Robust nonparametric methods are available but are limited to specific wave forms. We present RAIN, a robust nonparametric method for the detection of rhythms of prespecified periods in biological data that can detect arbitrary wave forms. When applied to measurements of the circadian transcriptome and proteome of mouse liver, the sets of transcripts and proteins with rhythmic abundances were significantly expanded due to the increased detection power, when we controlled for false discovery. Validation against independent data confirmed the quality of these results. The large expansion of the circadian mouse liver transcriptomes and proteomes reflected the prevalence of nonsymmetric wave forms and led to new conclusions about function. RAIN was implemented as a freely available software package for R/Bioconductor and is presently also available as a web interface.

...read moreread less

671 citations

Journal Article•10.1186/1471-2105-15-110•

On finding bicliques in bipartite graphs: a novel algorithm and its application to the integration of diverse biological data types

[...]

Yun Zhang¹, Charles A. Phillips², Gary L. Rogers², Erich J. Baker³, Elissa J. Chesler, Michael A. Langston² - Show less +2 more•Institutions (3)

DuPont Pioneer¹, University of Tennessee², Baylor University³

15 Apr 2014

TL;DR: A powerful new algorithm is described that produces all maximal bicliques in a bipartite graph, streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data.

...read moreread less

Abstract: Integrating and analyzing heterogeneous genome-scale data is a huge algorithmic challenge for modern systems biology. Bipartite graphs can be useful for representing relationships across pairs of disparate data types, with the interpretation of these relationships accomplished through an enumeration of maximal bicliques. Most previously-known techniques are generally ill-suited to this foundational task, because they are relatively inefficient and without effective scaling. In this paper, a powerful new algorithm is described that produces all maximal bicliques in a bipartite graph. Unlike most previous approaches, the new method neither places undue restrictions on its input nor inflates the problem size. Efficiency is achieved through an innovative exploitation of bipartite graph structure, and through computational reductions that rapidly eliminate non-maximal candidates from the search space. An iterative selection of vertices for consideration based on non-decreasing common neighborhood sizes boosts efficiency and leads to more balanced recursion trees. The new technique is implemented and compared to previously published approaches from graph theory and data mining. Formal time and space bounds are derived. Experiments are performed on both random graphs and graphs constructed from functional genomics data. It is shown that the new method substantially outperforms the best previous alternatives. The new method is streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data. A robust implementation has been incorporated into GeneWeaver, an online tool for integrating and analyzing functional genomics experiments, available at http://geneweaver.org . The enormous increase in scalability it provides empowers users to study complex and previously unassailable gene-set associations between genes and their biological functions in a hierarchical fashion and on a genome-wide scale. This practical computational resource is adaptable to almost any applications environment in which bipartite graphs can be used to model relationships between pairs of heterogeneous entities.

...read moreread less

208 citations

Journal Article•10.1186/1471-2105-15-S2-S2•

On the selection of appropriate distances for gene expression data clustering

[...]

Pablo A. Jaskowiak¹, Ricardo J. G. B. Campello¹, Ivan G. Costa², Ivan G. Costa³•Institutions (3)

University of São Paulo¹, Federal University of Pernambuco², RWTH Aachen University³

24 Jan 2014-BMC Bioinformatics

TL;DR: This work analyzes how different distances and clustering methods interact regarding their ability to cluster gene expression data, i.e., microarray data, and supports that the selection of an appropriate distance depends on the scenario in hand.

...read moreread less

Abstract: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

...read moreread less

166 citations

Journal Article•10.1371/JOURNAL.PONE.0097079•

Ensemble Positive Unlabeled Learning for Disease Gene Identification

[...]

Peng Yang¹, Xiaoli Li¹, Hon Nian Chua¹, Chee Keong Kwoh², See-Kiong Ng¹ - Show less +1 more•Institutions (2)

Agency for Science, Technology and Research¹, Nanyang Technological University²

09 May 2014-PLOS ONE

TL;DR: An effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification and achieves significantly better results compared with various state-of-the-art prediction methods as well as ensemble learningclassifiers is proposed.

...read moreread less

Abstract: An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.

...read moreread less

127 citations

Journal Article•10.1111/COBI.12257•

Integrating biological and social values when prioritizing places for biodiversity conservation

[...]

Amy L. Whitehead¹, Heini Kujala¹, Christopher D. Ives², Ascelin Gordon², Pia E. Lentini¹, Brendan A. Wintle¹, Emily Nicholson¹, Christopher M. Raymond³, Christopher M. Raymond⁴ - Show less +5 more•Institutions (4)

University of Melbourne¹, RMIT University², University of Stirling³, University of South Australia⁴

01 Aug 2014-Conservation Biology

TL;DR: An approach that incorporates quantitative data on social values for conservation and social preferences for development into spatial conservation planning is presented, able to identify areas of the landscape where synergies and conflicts between different value sets are likely to occur.

...read moreread less

Abstract: The consideration of information on social values in conjunction with biological data is critical for achieving both socially acceptable and scientifically defensible conservation planning outcomes. However, the influence of social values on spatial conservation priorities has received limited attention and is poorly understood. We present an approach that incorporates quantitative data on social values for conservation and social preferences for development into spatial conservation planning. We undertook a public participation GIS survey to spatially represent social values and development preferences and used species distribution models for 7 threatened fauna species to represent biological values. These spatially explicit data were simultaneously included in the conservation planning software Zonation to examine how conservation priorities changed with the inclusion of social data. Integrating spatially explicit information about social values and development preferences with biological data produced prioritizations that differed spatially from the solution based on only biological data. However, the integrated solutions protected a similar proportion of the species' distributions, indicating that Zonation effectively combined the biological and social data to produce socially feasible conservation solutions of approximately equivalent biological value. We were able to identify areas of the landscape where synergies and conflicts between different value sets are likely to occur. Identification of these synergies and conflicts will allow decision makers to target communication strategies to specific areas and ensure effective community engagement and positive conservation outcomes.

...read moreread less

126 citations

Journal Article•10.1016/J.GPB.2014.10.001•

Big Biological Data: Challenges and Opportunities

[...]

Yixue Li¹, Luonan Chen¹•Institutions (1)

Chinese Academy of Sciences¹

01 Oct 2014-Genomics, Proteomics & Bioinformatics

TL;DR: Several aspects of big biological data are described, which implies the central roles of bioinformatics and bioinformaticians in the future research of the biological and biomedical fields.

...read moreread less

115 citations

Journal Article•10.1186/1471-2105-15-304•

Predicting disease associations via biological network analysis

[...]

Kai Sun¹, Joana P. Gonçalves¹, Chris Larminie², Nataša Pržulj¹•Institutions (2)

Imperial College London¹, GlaxoSmithKline²

17 Sep 2014-BMC Bioinformatics

TL;DR: Three similarity measures for predicting disease associations are presented and the strong correlation between these predictions and known disease associations demonstrates the ability of these measures to provide novel insights into disease relationships.

...read moreread less

Abstract: Background Understanding the relationship between diseases based on the underlying biological mechanisms is one of the greatest challenges in modern biology and medicine. Exploring disease-disease associations by using system-level biological data is expected to improve our current knowledge of disease relationships, which may lead to further improvements in disease diagnosis, prognosis and treatment.

...read moreread less

111 citations

Journal Article•10.1093/NAR/GKU201•

Bi-Force: large-scale bicluster editing and its application to gene expression data biclustering

[...]

Peng Sun¹, Nora K Speicher¹, Richard Röttger¹, Richard Röttger², Jiong Guo¹, Jan Baumbach², Jan Baumbach³ - Show less +3 more•Institutions (3)

Saarland University¹, Max Planck Society², University of Southern Denmark³

14 May 2014-Nucleic Acids Research

TL;DR: The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering, and outperformed existing tools with Bi- Force at least when following the evaluation protocols from Eren et al.

...read moreread less

Abstract: The explosion of the biological data has dramatically reformed today’s biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as ‘simultaneous clustering’ or ‘co-clustering’, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: ‘Bi-Force’. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of BiForce to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279–292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.

...read moreread less

107 citations

Journal Article•10.1093/NAR/GKU337•

DINIES: drug–target interaction network inference engine based on supervised analysis

[...]

Yoshihiro Yamanishi¹, Masaaki Kotera², Yuki Moriya³, Ryusuke Sawada¹, Minoru Kanehisa³, Susumu Goto³ - Show less +2 more•Institutions (3)

Kyushu University¹, Tokyo Institute of Technology², Kyoto University³

01 Jul 2014-Nucleic Acids Research

TL;DR: The originality of DINIES lies in prediction with state-of-the-art machine learning methods, in the integration of heterogeneous biological data and in compatibility with the KEGG database.

...read moreread less

Abstract: DINIES (drug-target interaction network inference engine based on supervised analysis) is a web server for predicting unknown drug-target interaction net- works from various types of biological data (e.g. chemical structures, drug side effects, amino acid sequences and protein domains) in the framework of supervised network inference. The originality of DINIES lies in prediction with state-of-the-art ma- chine learning methods, in the integration of het- erogeneous biological data and in compatibility with the KEGG database. The DINIES server accepts any 'profiles' or precalculated similarity matrices (or 'ker- nels') of drugs and target proteins in tab-delimited file format. When a training data set is submitted to learn a predictive model, users can select either known interaction information in the KEGG DRUG database or their own interaction data. The user can also select an algorithm for supervised network in- ference, select various parameters in the method and specify weights for heterogeneous data inte- gration. The server can provide integrative analyses with useful components in KEGG, such as biological pathways, functional hierarchy and human diseases. DINIES (http://www.genome.jp/tools/dinies/) is pub- licly available as one of the genome analysis tools in GenomeNet.

...read moreread less

104 citations

Journal Article•10.15252/MSB.20145108•

An integrative, multi‐scale, genome‐wide model reveals the phenotypic landscape of Escherichia coli

[...]

Javier Carrera¹, Raissa Estrela², Jing Luo¹, Navneet Rai¹, Athanasios Tsoukalas¹, Ilias Tagkopoulos¹ - Show less +2 more•Institutions (2)

University of California, Davis¹, University of California, Berkeley²

01 Jul 2014-Molecular Systems Biology

TL;DR: This work presents an integrative modeling methodology that unifies under a common framework the various biological processes and their interactions across multiple layers and paves the way toward integrative techniques that extract knowledge from a variety of biological data to achieve more than the sum of their parts in the context of prediction, analysis, and redesign of biological systems.

...read moreread less

Abstract: Given the vast behavioral repertoire and biological complexity of even the simplest organisms, accurately predicting phenotypes in novel environments and unveiling their biological organization is a challenging endeavor. Here, we present an integrative modeling methodology that unifies under a common framework the various biological processes and their interactions across multiple layers. We trained this methodology on an extensive normalized compendium for the gram-negative bacterium Escherichia coli, which incorporates gene expression data for genetic and environmental perturbations, transcriptional regulation, signal transduction, and metabolic pathways, as well as growth measurements. Comparison with measured growth and high-throughput data demonstrates the enhanced ability of the integrative model to predict phenotypic outcomes in various environmental and genetic conditions, even in cases where their underlying functions are under-represented in the training set. This work paves the way toward integrative techniques that extract knowledge from a variety of biological data to achieve more than the sum of their parts in the context of prediction, analysis, and redesign of biological systems.

...read moreread less

90 citations

Journal Article•10.4172/2153-0602.1000158•

Use of Bioinformatics Tools in Different Spheres of Life Sciences

[...]

Muhammad Aamer Mehmood, Ujala Sehar, Niaz Ahmad

01 Jan 2014-Journal of Data Mining in Genomics & Proteomics

TL;DR: This review will focus on those areas of biological research, which can be greatly assisted by such tools like analysing a DNA and protein sequence to identify various features, prediction of 3D structure of protein molecules, to study molecular interactions, and to perform simulations to mimic a biological phenomenon to extract useful information from the biological data.

...read moreread less

Abstract: The pace, by which scientific knowledge is being produced and shared today, was never been so fast in the past. Different areas of science are getting closer to each other to give rise new disciplines. Bioinformatics is one of such newly emerging fields, which makes use of computer, mathematics and statistics in molecular biology to archive, retrieve, and analyse biological data. Although yet at infancy, it has become one of the fastest growing fields, and quickly established itself as an integral component of any biological research activity. It is getting popular due to its ability to analyse huge amount of biological data quickly and cost-effectively. Bioinformatics can assist a biologist to extract valuable information from biological data providing various web- and/or computer-based tools, the majority of which are freely available. The present review gives a comprehensive summary of some of these tools available to a life scientist to analyse biological data. Exclusively this review will focus on those areas of biological research, which can be greatly assisted by such tools like analysing a DNA and protein sequence to identify various features, prediction of 3D structure of protein molecules, to study molecular interactions, and to perform simulations to mimic a biological phenomenon to extract useful information from the biological data.

...read moreread less

Book Chapter•10.1007/978-1-62703-748-8_4•

Introduction to bioinformatics.

[...]

Tolga Can¹•Institutions (1)

Middle East Technical University¹

01 Jan 2014-Methods of Molecular Biology

TL;DR: This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bio informatics problems organized by the types of data sources.

...read moreread less

Abstract: Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.

...read moreread less

Book Chapter•10.1007/978-3-662-45006-2_9•

Exploratory Data Analysis

[...]

Janine Vierheller¹•Institutions (1)

University of Potsdam¹

1 Jan 2014

TL;DR: The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed.

...read moreread less

Abstract: In bioinformatics the term exploratory data analysis refers to different methods to get an overview of large biological data sets Hence, it helps to create a framework for further analysis and hypothesis testing The workflow facilitates this first important step of the data analysis created by high-throughput technologies The results are different plots showing the structure of the measurements The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed The basic tool is the free software R

...read moreread less

Journal Article•10.1186/1471-2164-15-S8-S3•

High dimensional biological data retrieval optimization with NoSQL technology

[...]

Shicai Wang¹, Ioannis Pandis¹, Chao Wu¹, Sijin He¹, David Johnson¹, Ibrahim Emam¹, Florian Guitton¹, Yike Guo¹ - Show less +4 more•Institutions (1)

Imperial College London¹

13 Nov 2014-BMC Genomics

TL;DR: A new key-value pair data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance is introduced and used as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

...read moreread less

Abstract: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

...read moreread less

Journal Article•10.1186/1471-2105-15-S13-S4•

Feature selection and classifier performance on diverse bio- logical datasets

[...]

Edward Hemphill, James Lindsay, Chih Lee, Ion I. Mandoiu, Craig E. Nelson - Show less +1 more

13 Nov 2014-BMC Bioinformatics

TL;DR: It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.

...read moreread less

Abstract: Background: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model’s performance across multiple biological data types Results: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines A validation experiment was conducted using external data in order to demonstrate robustness Conclusions: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/ classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTU472•

The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective.

[...]

Yuxiang Jiang¹, Wyatt T. Clark¹, Iddo Friedberg¹, Predrag Radivojac¹•Institutions (1)

Miami University¹

01 Sep 2014-Bioinformatics

TL;DR: The results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments and suggest that current large-scale evaluations are meaningful and almost surprisingly reliable.

...read moreread less

Abstract: Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. Contact: predrag@indiana.edu Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•10.1093/BIOINFORMATICS/BTT643•

Bayesian network prior: network analysis of biological data using external knowledge

[...]

Senol Isci¹, Haluk Dogan¹, Cengizhan Ozturk¹, Hasan H. Otu¹•Institutions (1)

Istanbul Bilgi University¹

15 Mar 2014-Bioinformatics

TL;DR: This work proposes a framework where GI networks are learned from experimental data using Bayesian networks (BNs) and the incorporation of external knowledge is also done via a BN that is called Bayesian Network Prior (BNP).

...read moreread less

Abstract: Motivation: Reverse engineering GI networks from experimental data is a challenging task due to the complex nature of the networks and the noise inherent in the data. One way to overcome these hurdles would be incorporating the vast amounts of external biological knowledge when building interaction networks. We propose a framework where GI networks are learned from experimental data using Bayesian networks (BNs) and the incorporation of external knowledge is also done via a BN that we call Bayesian Network Prior (BNP). BNP depicts the relation between various evidence types that contribute to the event ‘gene interaction’ and is used to calculate the probability of a candidate graph (G) in the structure learning process. Results: Our simulation results on synthetic, simulated and real biological data show that the proposed approach can identify the underlying interaction network with high accuracy even when the prior information is distorted and outperforms existing methods. Availability: Accompanying BNP software package is freely available for academic use at http://bioe.bilgi.edu.tr/BNP. Contact: hasan.otu@bilgi.edu.tr Supplementary Information: Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•10.1021/AC402869B•

Integrated analysis of seaweed components during seasonal fluctuation by data mining across heterogeneous chemical measurements with network visualization.

[...]

Kengo Ito¹, Kenji Sakata, Yasuhiro Date¹, Jun Kikuchi•Institutions (1)

Yokohama City University¹

08 Jan 2014-Analytical Chemistry

TL;DR: A method of data preprocessing that can perform comprehensively integrated analysis based on a variety of multimeasurement of organic and inorganic chemical data from Sargassum fusiforme is described to explore the concealed biological information by statistical analyses with integrated data.

...read moreread less

Abstract: Biological information is intricately intertwined with several factors. Therefore, comprehensive analytical methods such as integrated data analysis, combining several data measurements, are required. In this study, we describe a method of data preprocessing that can perform comprehensively integrated analysis based on a variety of multimeasurement of organic and inorganic chemical data from Sargassum fusiforme and explore the concealed biological information by statistical analyses with integrated data. Chemical components including polar and semipolar metabolites, minerals, major elemental and isotopic ratio, and thermal decompositional data were measured as environmentally responsive biological data in the seasonal variation. The obtained spectral data of complex chemical components were preprocessed to isolate pure peaks by removing noise and separating overlapping signals using the multivariate curve resolution alternating least-squares method before integrated analyses. By the input of these preprocessed multimeasurement chemical data, principal component analysis and self-organizing maps of integrated data showed changes in the chemical compositions during the mature stage and identified trends in seasonal variation. Correlation network analysis revealed multiple relationships between organic and inorganic components. Moreover, in terms of the relationship between metal group and metabolites, the results of structural equation modeling suggest that the structure of alginic acid changes during the growth of S. fusiforme, which affects its metal binding ability. This integrated analytical approach using a variety of chemical data can be developed for practical applications to obtain new biochemical knowledge including genetic and environmental information.

...read moreread less

Journal Article•10.1016/J.COMPBIOLCHEM.2016.08.002•

Recurrent Neural Network Based Hybrid Model of Gene Regulatory Network.

[...]

Khalid Raza, Mansaf Alam, Rafat Parveen

22 Aug 2014-arXiv: Neural and Evolutionary Computing

TL;DR: A recurrent neural network (RNN) based model of GRN, hybridized with generalized extended Kalman filter for weight update in backpropagation through time training algorithm, and a comparison of the results with other state-of-the-art techniques shows superiority of the proposed model.

...read moreread less

Abstract: Systems biology is an emerging interdisciplinary area of research that focuses on study of complex interactions in a biological system, such as gene regulatory networks. The discovery of gene regulatory networks leads to a wide range of applications, such as pathways related to a disease that can unveil in what way the disease acts and provide novel tentative drug targets. In addition, the development of biological models from discovered networks or pathways can help to predict the responses to disease and can be much useful for the novel drug development and treatments. The inference of regulatory networks from biological data is still in its infancy stage. This paper proposes a recurrent neural network (RNN) based gene regulatory network (GRN) model hybridized with generalized extended Kalman filter for weight update in backpropagation through time training algorithm. The RNN is a complex neural network that gives a better settlement between the biological closeness and mathematical flexibility to model GRN. The RNN is able to capture complex, non-linear and dynamic relationship among variables. Gene expression data are inherently noisy and Kalman filter performs well for estimation even in noisy data. Hence, non-linear version of Kalman filter, i.e., generalized extended Kalman filter has been applied for weight update during network training. The developed model has been applied on DNA SOS repair network, IRMA network, and two synthetic networks from DREAM Challenge. We compared our results with other state-of-the-art techniques that show superiority of our model. Further, 5% Gaussian noise has been added in the dataset and result of the proposed model shows negligible effect of noise on the results.

...read moreread less

Journal Article•10.1016/J.ARTMED.2014.01.005•

Multi-test decision tree and its application to microarray data classification

[...]

Marcin Czajkowski¹, Marek Grześ², Marek Kretowski¹•Institutions (2)

Bialystok University of Technology¹, University of Waterloo²

01 May 2014-Artificial Intelligence in Medicine

TL;DR: A new type of decision tree which is relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts is introduced, which is more suitable for solving biological problems.

...read moreread less

Journal Article•10.1007/S11427-014-4745-8•

Disease gene identification by using graph kernels and Markov random fields

[...]

Bolin Chen¹, Min Li², Jianxin Wang², Fang-Xiang Wu¹•Institutions (2)

University of Saskatchewan¹, Central South University²

14 Oct 2014-Science China-life Sciences

TL;DR: The proposed kernel-based MRF method is evaluated by the leave-one-out cross validation paradigm, achieving an AUC score of 0.771 when integrating all those biological data in the authors' experiments, which indicates that the proposed method is very promising compared with many existing methods.

...read moreread less

Abstract: Genes associated with similar diseases are often functionally related. This principle is largely supported by many biological data sources, such as disease phenotype similarities, protein complexes, protein-protein interactions, pathways and gene expression profiles. Integrating multiple types of biological data is an effective method to identify disease genes for many genetic diseases. To capture the gene-disease associations based on biological networks, a kernel-based MRF method is proposed by combining graph kernels and the Markov random field (MRF) method. In the proposed method, three kinds of kernels are employed to describe the overall relationships of vertices in five biological networks, respectively, and a novel weighted MRF method is developed to integrate those data. In addition, an improved Gibbs sampling procedure and a novel parameter estimation method are proposed to generate predictions from the kernel-based MRF method. Numerical experiments are carried out by integrating known gene-disease associations, protein complexes, protein-protein interactions, pathways and gene expression profiles. The proposed kernel-based MRF method is evaluated by the leave-one-out cross validation paradigm, achieving an AUC score of 0.771 when integrating all those biological data in our experiments, which indicates that our proposed method is very promising compared with many existing methods.

...read moreread less

Journal Article•10.1016/J.BBRC.2013.11.103•

supraHex: An R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map

[...]

Hai Fang¹, Julian Gough¹•Institutions (1)

University of Bristol¹

03 Jan 2014-Biochemical and Biophysical Research Communications

TL;DR: The supraHex map can tell inherent relations between replication timing, CpG and expression and can be overlaid by additional data for multilayer omics data comparisons.

...read moreread less

Journal Article•10.1155/2014/173869•

Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

[...]

Muhammad Iqbal¹, Ibrahima Faye¹, Brahim Belhaouari Samir², Abas Md Said¹•Institutions (2)

Universiti Teknologi Petronas¹, Alfaisal University²

19 Jun 2014-The Scientific World Journal

TL;DR: A statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector of protein classification, and shows significant improvement in terms of performance measure metrics.

...read moreread less

Abstract: Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.

...read moreread less

Journal Article•10.1111/2041-210X.12190•

Spatial targeting of infectious disease control: Identifying multiple, unknown sources

[...]

Robert Verity¹, Mark D. Stevenson¹, D. Kim Rossmo², Richard A. Nichols¹, Steven C. Le Comber¹ - Show less +1 more•Institutions (2)

Queen Mary University of London¹, Texas State University²

01 Jul 2014-Methods in Ecology and Evolution

TL;DR: It is suggested that the Dirichlet process mixture (DPM) model provides a useful and practical tool for conservation biologists and epidemiologists that can be used to inform management decisions and public health policy.

...read moreread less

Abstract: 1. Geographic profiling (GP) was originally developed as an analytical tool in criminology, where it uses the spatial locations of linked crimes (for example murder, rape or arson) to identify areas that are most likely to include the offender's residence. The technique has been extremely successful in this field and is now widely used by police forces and investigative agencies around the world. More recently, the same method has been applied to biological data, notably in spatial epidemiology, where it uses the locations of disease cases to identify infection sources: the identification of these sources is critical to control efforts of diseases such as malaria, since targeted intervention is more efficient and cost-effective than untargeted intervention. 2. Here, we solve the problem of identifying multiple sources, even when the number of sources is unknown - a requirement for many biological studies. We present a new, rigorous mathematical and computational method and show why previous Bayesian methods were often outperformed by the empirically developed criminal geographic targeting (CGT) algorithm used in criminology. 3. We use simulations and real-world examples to compare our model to both the CGT algorithm and to an existing Bayesian model. We demonstrate that our method combines the advantages of both previous methods, particularly in cases featuring large data sets and multiple sources. 4. Our approach provides an increase in search efficiency over other methods and is likely to lead to improved targeting of interventions and more efficient use of resources. We suggest that the Dirichlet process mixture (DPM) model provides a useful and practical tool for conservation biologists and epidemiologists that can be used to inform management decisions and public health policy.

...read moreread less

Journal Article•10.1093/BIB/BBT041•

Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction networks

[...]

Xiaotu Ma, Ting Chen, Fengzhu Sun

01 Sep 2014-Briefings in Bioinformatics

TL;DR: Applications of similarity measures over networks are reviewed with a special focus on predicting protein functions, prioritizing genes related to a phenotype given a set of seed genes that have been shown to be related to the phenotype, and identification of false positives and false negatives from RNAi experiments.

...read moreread less

Abstract: With the rapid development of biotechnologies, many types of biological data including molecular networks are now available. However, to obtain a more complete understanding of a biological system, the integration of molecular networks with other data, such as molecular sequences, protein domains and gene expression profiles, is needed. A key to the use of networks in biological studies is the definition of similarity among proteins over the networks. Here, we review applications of similarity measures over networks with a special focus on the following four problems: (i) predicting protein functions, (ii) prioritizing genes related to a phenotype given a set of seed genes that have been shown to be related to the phenotype, (iii) prioritizing genes related to a phenotype by integrating gene expression profiles and networks and (iv) identification of false positives and false negatives from RNAi experiments. Diffusion kernels are demonstrated to give superior performance in all these tasks, leading to the suggestion that diffusion kernels should be the primary choice for a network similarity metric over other similarity measures such as direct neighbors and shortest path distance.

...read moreread less

Journal Article•10.1016/J.MOLP.2014.11.012•

Designing Microarray and RNA-seq Experiments for Greater Systems Biology Discovery in Modern Plant Genomics.

[...]

Chuanping Yang¹, Hairong Wei², Hairong Wei¹•Institutions (2)

Northeast Forestry University¹, Michigan Technological University²

09 Nov 2014-Molecular Plant

TL;DR: The major issues that researchers commonly face when embarking on microarray or RNA-seq experiments are discussed and important aspects of experimental design are summarized to help researchers deliberate how to generate gene expression profiles with low background noise but with more interaction to facilitate novel biological discoveries in modern plant genomics.

...read moreread less

Journal Article•10.1371/JOURNAL.PCBI.1003498•

Model-based analysis for qualitative data: an application in Drosophila germline stem cell regulation.

[...]

Michael Pargett¹, Ann E. Rundell¹, Gregery T. Buzzard¹, David M. Umulis¹•Institutions (1)

Purdue University¹

13 Mar 2014-PLOS Computational Biology

TL;DR: A general parameter estimation process to quantitatively optimize models with qualitative data is developed and recommendations for experiments to refine model parameters and discriminate increasingly complex hypotheses are provided.

...read moreread less

Abstract: Discovery in developmental biology is often driven by intuition that relies on the integration of multiple types of data such as fluorescent images, phenotypes, and the outcomes of biochemical assays. Mathematical modeling helps elucidate the biological mechanisms at play as the networks become increasingly large and complex. However, the available data is frequently under-utilized due to incompatibility with quantitative model tuning techniques. This is the case for stem cell regulation mechanisms explored in the Drosophila germarium through fluorescent immunohistochemistry. To enable better integration of biological data with modeling in this and similar situations, we have developed a general parameter estimation process to quantitatively optimize models with qualitative data. The process employs a modified version of the Optimal Scaling method from social and behavioral sciences, and multi-objective optimization to evaluate the trade-off between fitting different datasets (e.g. wild type vs. mutant). Using only published imaging data in the germarium, we first evaluated support for a published intracellular regulatory network by considering alternative connections of the same regulatory players. Simply screening networks against wild type data identified hundreds of feasible alternatives. Of these, five parsimonious variants were found and compared by multi-objective analysis including mutant data and dynamic constraints. With these data, the current model is supported over the alternatives, but support for a biochemically observed feedback element is weak (i.e. these data do not measure the feedback effect well). When also comparing new hypothetical models, the available data do not discriminate. To begin addressing the limitations in data, we performed a model-based experiment design and provide recommendations for experiments to refine model parameters and discriminate increasingly complex hypotheses.

...read moreread less

Journal Article•10.1016/J.BBRC.2014.02.146•

Robust and stable feature selection by integrating ranking methods and wrapper technique in genetic data classification.

[...]

Maryam Yassi¹, Mohammad Hossein Moattar¹•Institutions (1)

Islamic Azad University¹

18 Apr 2014-Biochemical and Biophysical Research Communications

TL;DR: The main goal of this paper is to provide a method for dimension reduction and classification of genetic data sets by combining Wrapper method with the proposed hybrid ranking method to embed the interaction between genes.

...read moreread less

Journal Article•10.1371/JOURNAL.PONE.0091315•

Parallel clustering algorithm for large-scale biological data sets.

[...]

Minchao Wang¹, Wu Zhang¹, Wang Ding¹, Dongbo Dai¹, Huiran Zhang¹, Hao Xie², Luonan Chen³, Yike Guo⁴, Jiang Xie¹ - Show less +5 more•Institutions (4)

Shanghai University¹, Wuhan University², Chinese Academy of Sciences³, Imperial College London⁴

04 Apr 2014-PLOS ONE

TL;DR: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm, and the runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively.

...read moreread less

Abstract: Backgrounds Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs. Methods Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes. Result A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

...read moreread less

Patent•

System and method for real-time personalization utilizing an individual's genomic data

[...]

Brandon Colby, Ashwin Kotwaliwale

8 Dec 2014

TL;DR: In this paper, the authors present methods and systems for processing personal biological data for real-time or near-real-time application, which includes a received reference genome and a received personal genome.

...read moreread less

Abstract: The principles of the present invention provide methods and systems for processing personal biological data for real time or near real time application. An exemplary system includes a received reference genome and a received personal genome. The genomes are accessed over a network by one or more servers. Input from one or more sensors associated with an individual or remote from the individual is used in conjunction with the individual's genomic data or the results of the comparison of the individual's genetic data and the reference genome(s) to provide real-time or near real-time suggestions, recommendations, warnings and the like in view of the sensor data and genomic data.

...read moreread less

...

Expand