TL;DR: RAIN is presented, a robust nonparametric method for the detection of rhythms of prespecified periods in biological data that can detect arbitrary wave forms and the sets of transcripts and proteins with rhythmic abundances were significantly expanded due to the increased detection power.
Abstract: A fundamental problem in research on biological rhythms is that of detecting and assessing the significance of rhythms in large sets of data. Classic methods based on Fourier theory are often hampered by the complex and unpredictable characteristics of experimental and biological noise. Robust nonparametric methods are available but are limited to specific wave forms. We present RAIN, a robust nonparametric method for the detection of rhythms of prespecified periods in biological data that can detect arbitrary wave forms. When applied to measurements of the circadian transcriptome and proteome of mouse liver, the sets of transcripts and proteins with rhythmic abundances were significantly expanded due to the increased detection power, when we controlled for false discovery. Validation against independent data confirmed the quality of these results. The large expansion of the circadian mouse liver transcriptomes and proteomes reflected the prevalence of nonsymmetric wave forms and led to new conclusions about function. RAIN was implemented as a freely available software package for R/Bioconductor and is presently also available as a web interface.
TL;DR: A powerful new algorithm is described that produces all maximal bicliques in a bipartite graph, streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data.
Abstract: Integrating and analyzing heterogeneous genome-scale data is a huge algorithmic challenge for modern systems biology. Bipartite graphs can be useful for representing relationships across pairs of disparate data types, with the interpretation of these relationships accomplished through an enumeration of maximal bicliques. Most previously-known techniques are generally ill-suited to this foundational task, because they are relatively inefficient and without effective scaling. In this paper, a powerful new algorithm is described that produces all maximal bicliques in a bipartite graph. Unlike most previous approaches, the new method neither places undue restrictions on its input nor inflates the problem size. Efficiency is achieved through an innovative exploitation of bipartite graph structure, and through computational reductions that rapidly eliminate non-maximal candidates from the search space. An iterative selection of vertices for consideration based on non-decreasing common neighborhood sizes boosts efficiency and leads to more balanced recursion trees. The new technique is implemented and compared to previously published approaches from graph theory and data mining. Formal time and space bounds are derived. Experiments are performed on both random graphs and graphs constructed from functional genomics data. It is shown that the new method substantially outperforms the best previous alternatives. The new method is streamlined, efficient, and particularly well-suited to the study of huge and diverse biological data. A robust implementation has been incorporated into GeneWeaver, an online tool for integrating and analyzing functional genomics experiments, available at http://geneweaver.org
. The enormous increase in scalability it provides empowers users to study complex and previously unassailable gene-set associations between genes and their biological functions in a hierarchical fashion and on a genome-wide scale. This practical computational resource is adaptable to almost any applications environment in which bipartite graphs can be used to model relationships between pairs of heterogeneous entities.
TL;DR: This work analyzes how different distances and clustering methods interact regarding their ability to cluster gene expression data, i.e., microarray data, and supports that the selection of an appropriate distance depends on the scenario in hand.
Abstract: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
TL;DR: An effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification and achieves significantly better results compared with various state-of-the-art prediction methods as well as ensemble learningclassifiers is proposed.
Abstract: An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.
TL;DR: An approach that incorporates quantitative data on social values for conservation and social preferences for development into spatial conservation planning is presented, able to identify areas of the landscape where synergies and conflicts between different value sets are likely to occur.
Abstract: The consideration of information on social values in conjunction with biological data is critical for achieving both socially acceptable and scientifically defensible conservation planning outcomes. However, the influence of social values on spatial conservation priorities has received limited attention and is poorly understood. We present an approach that incorporates quantitative data on social values for conservation and social preferences for development into spatial conservation planning. We undertook a public participation GIS survey to spatially represent social values and development preferences and used species distribution models for 7 threatened fauna species to represent biological values. These spatially explicit data were simultaneously included in the conservation planning software Zonation to examine how conservation priorities changed with the inclusion of social data. Integrating spatially explicit information about social values and development preferences with biological data produced prioritizations that differed spatially from the solution based on only biological data. However, the integrated solutions protected a similar proportion of the species' distributions, indicating that Zonation effectively combined the biological and social data to produce socially feasible conservation solutions of approximately equivalent biological value. We were able to identify areas of the landscape where synergies and conflicts between different value sets are likely to occur. Identification of these synergies and conflicts will allow decision makers to target communication strategies to specific areas and ensure effective community engagement and positive conservation outcomes.
TL;DR: Several aspects of big biological data are described, which implies the central roles of bioinformatics and bioinformaticians in the future research of the biological and biomedical fields.
TL;DR: Three similarity measures for predicting disease associations are presented and the strong correlation between these predictions and known disease associations demonstrates the ability of these measures to provide novel insights into disease relationships.
Abstract: Background
Understanding the relationship between diseases based on the underlying biological mechanisms is one of the greatest challenges in modern biology and medicine. Exploring disease-disease associations by using system-level biological data is expected to improve our current knowledge of disease relationships, which may lead to further improvements in disease diagnosis, prognosis and treatment.
TL;DR: The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering, and outperformed existing tools with Bi- Force at least when following the evaluation protocols from Eren et al.
Abstract: The explosion of the biological data has dramatically reformed today’s biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as ‘simultaneous clustering’ or ‘co-clustering’, has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: ‘Bi-Force’. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of BiForce to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279–292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.
TL;DR: The originality of DINIES lies in prediction with state-of-the-art machine learning methods, in the integration of heterogeneous biological data and in compatibility with the KEGG database.
Abstract: DINIES (drug-target interaction network inference engine based on supervised analysis) is a web server for predicting unknown drug-target interaction net- works from various types of biological data (e.g. chemical structures, drug side effects, amino acid sequences and protein domains) in the framework of supervised network inference. The originality of DINIES lies in prediction with state-of-the-art ma- chine learning methods, in the integration of het- erogeneous biological data and in compatibility with the KEGG database. The DINIES server accepts any 'profiles' or precalculated similarity matrices (or 'ker- nels') of drugs and target proteins in tab-delimited file format. When a training data set is submitted to learn a predictive model, users can select either known interaction information in the KEGG DRUG database or their own interaction data. The user can also select an algorithm for supervised network in- ference, select various parameters in the method and specify weights for heterogeneous data inte- gration. The server can provide integrative analyses with useful components in KEGG, such as biological pathways, functional hierarchy and human diseases. DINIES (http://www.genome.jp/tools/dinies/) is pub- licly available as one of the genome analysis tools in GenomeNet.
TL;DR: This work presents an integrative modeling methodology that unifies under a common framework the various biological processes and their interactions across multiple layers and paves the way toward integrative techniques that extract knowledge from a variety of biological data to achieve more than the sum of their parts in the context of prediction, analysis, and redesign of biological systems.
Abstract: Given the vast behavioral repertoire and biological complexity of even the simplest organisms, accurately predicting phenotypes in novel environments and unveiling their biological organization is a challenging endeavor. Here, we present an integrative modeling methodology that unifies under a common framework the various biological processes and their interactions across multiple layers. We trained this methodology on an extensive normalized compendium for the gram-negative bacterium Escherichia coli, which incorporates gene expression data for genetic and environmental perturbations, transcriptional regulation, signal transduction, and metabolic pathways, as well as growth measurements. Comparison with measured growth and high-throughput data demonstrates the enhanced ability of the integrative model to predict phenotypic outcomes in various environmental and genetic conditions, even in cases where their underlying functions are under-represented in the training set. This work paves the way toward integrative techniques that extract knowledge from a variety of biological data to achieve more than the sum of their parts in the context of prediction, analysis, and redesign of biological systems.
TL;DR: This review will focus on those areas of biological research, which can be greatly assisted by such tools like analysing a DNA and protein sequence to identify various features, prediction of 3D structure of protein molecules, to study molecular interactions, and to perform simulations to mimic a biological phenomenon to extract useful information from the biological data.
Abstract: The pace, by which scientific knowledge is being produced and shared today, was never been so fast in the past. Different areas of science are getting closer to each other to give rise new disciplines. Bioinformatics is one of such newly emerging fields, which makes use of computer, mathematics and statistics in molecular biology to archive, retrieve, and analyse biological data. Although yet at infancy, it has become one of the fastest growing fields, and quickly established itself as an integral component of any biological research activity. It is getting popular due to its ability to analyse huge amount of biological data quickly and cost-effectively. Bioinformatics can assist a biologist to extract valuable information from biological data providing various web- and/or computer-based tools, the majority of which are freely available. The present review gives a comprehensive summary of some of these tools available to a life scientist to analyse biological data. Exclusively this review will focus on those areas of biological research, which can be greatly assisted by such tools like analysing a DNA and protein sequence to identify various features, prediction of 3D structure of protein molecules, to study molecular interactions, and to perform simulations to mimic a biological phenomenon to extract useful information from the biological data.
TL;DR: This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bio informatics problems organized by the types of data sources.
Abstract: Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.
TL;DR: The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed.
Abstract: In bioinformatics the term exploratory data analysis refers to different methods to get an overview of large biological data sets Hence, it helps to create a framework for further analysis and hypothesis testing The workflow facilitates this first important step of the data analysis created by high-throughput technologies The results are different plots showing the structure of the measurements The goal of the workflow is the automatization of the exploratory data analysis, but also the flexibility should be guaranteed The basic tool is the free software R
TL;DR: A new key-value pair data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance is introduced and used as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
Abstract: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
TL;DR: It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.
Abstract: Background: There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model’s performance across multiple biological data types Results: This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines A validation experiment was conducted using external data in order to demonstrate robustness Conclusions: As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/ classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis
TL;DR: The results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments and suggest that current large-scale evaluations are meaningful and almost surprisingly reliable.
Abstract: Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. Contact: predrag@indiana.edu Supplementary information: Supplementary data are available at Bioinformatics online.
TL;DR: This work proposes a framework where GI networks are learned from experimental data using Bayesian networks (BNs) and the incorporation of external knowledge is also done via a BN that is called Bayesian Network Prior (BNP).
Abstract: Motivation: Reverse engineering GI networks from experimental data is a challenging task due to the complex nature of the networks and the noise inherent in the data. One way to overcome these hurdles would be incorporating the vast amounts of external biological knowledge when building interaction networks. We propose a framework where GI networks are learned from experimental data using Bayesian networks (BNs) and the incorporation of external knowledge is also done via a BN that we call Bayesian Network Prior (BNP). BNP depicts the relation between various evidence types that contribute to the event ‘gene interaction’ and is used to calculate the probability of a candidate graph (G) in the structure learning process. Results: Our simulation results on synthetic, simulated and real biological data show that the proposed approach can identify the underlying interaction network with high accuracy even when the prior information is distorted and outperforms existing methods. Availability: Accompanying BNP software package is freely available for academic use at http://bioe.bilgi.edu.tr/BNP. Contact: hasan.otu@bilgi.edu.tr Supplementary Information: Supplementary data are available at Bioinformatics online.
TL;DR: A method of data preprocessing that can perform comprehensively integrated analysis based on a variety of multimeasurement of organic and inorganic chemical data from Sargassum fusiforme is described to explore the concealed biological information by statistical analyses with integrated data.
Abstract: Biological information is intricately intertwined with several factors. Therefore, comprehensive analytical methods such as integrated data analysis, combining several data measurements, are required. In this study, we describe a method of data preprocessing that can perform comprehensively integrated analysis based on a variety of multimeasurement of organic and inorganic chemical data from Sargassum fusiforme and explore the concealed biological information by statistical analyses with integrated data. Chemical components including polar and semipolar metabolites, minerals, major elemental and isotopic ratio, and thermal decompositional data were measured as environmentally responsive biological data in the seasonal variation. The obtained spectral data of complex chemical components were preprocessed to isolate pure peaks by removing noise and separating overlapping signals using the multivariate curve resolution alternating least-squares method before integrated analyses. By the input of these preprocessed multimeasurement chemical data, principal component analysis and self-organizing maps of integrated data showed changes in the chemical compositions during the mature stage and identified trends in seasonal variation. Correlation network analysis revealed multiple relationships between organic and inorganic components. Moreover, in terms of the relationship between metal group and metabolites, the results of structural equation modeling suggest that the structure of alginic acid changes during the growth of S. fusiforme, which affects its metal binding ability. This integrated analytical approach using a variety of chemical data can be developed for practical applications to obtain new biochemical knowledge including genetic and environmental information.
TL;DR: A recurrent neural network (RNN) based model of GRN, hybridized with generalized extended Kalman filter for weight update in backpropagation through time training algorithm, and a comparison of the results with other state-of-the-art techniques shows superiority of the proposed model.
Abstract: Systems biology is an emerging interdisciplinary area of research that focuses on study of complex interactions in a biological system, such as gene regulatory networks. The discovery of gene regulatory networks leads to a wide range of applications, such as pathways related to a disease that can unveil in what way the disease acts and provide novel tentative drug targets. In addition, the development of biological models from discovered networks or pathways can help to predict the responses to disease and can be much useful for the novel drug development and treatments. The inference of regulatory networks from biological data is still in its infancy stage. This paper proposes a recurrent neural network (RNN) based gene regulatory network (GRN) model hybridized with generalized extended Kalman filter for weight update in backpropagation through time training algorithm. The RNN is a complex neural network that gives a better settlement between the biological closeness and mathematical flexibility to model GRN. The RNN is able to capture complex, non-linear and dynamic relationship among variables. Gene expression data are inherently noisy and Kalman filter performs well for estimation even in noisy data. Hence, non-linear version of Kalman filter, i.e., generalized extended Kalman filter has been applied for weight update during network training. The developed model has been applied on DNA SOS repair network, IRMA network, and two synthetic networks from DREAM Challenge. We compared our results with other state-of-the-art techniques that show superiority of our model. Further, 5% Gaussian noise has been added in the dataset and result of the proposed model shows negligible effect of noise on the results.
TL;DR: A new type of decision tree which is relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts is introduced, which is more suitable for solving biological problems.
TL;DR: The proposed kernel-based MRF method is evaluated by the leave-one-out cross validation paradigm, achieving an AUC score of 0.771 when integrating all those biological data in the authors' experiments, which indicates that the proposed method is very promising compared with many existing methods.
Abstract: Genes associated with similar diseases are often functionally related. This principle is largely supported by many biological data sources, such as disease phenotype similarities, protein complexes, protein-protein interactions, pathways and gene expression profiles. Integrating multiple types of biological data is an effective method to identify disease genes for many genetic diseases. To capture the gene-disease associations based on biological networks, a kernel-based MRF method is proposed by combining graph kernels and the Markov random field (MRF) method. In the proposed method, three kinds of kernels are employed to describe the overall relationships of vertices in five biological networks, respectively, and a novel weighted MRF method is developed to integrate those data. In addition, an improved Gibbs sampling procedure and a novel parameter estimation method are proposed to generate predictions from the kernel-based MRF method. Numerical experiments are carried out by integrating known gene-disease associations, protein complexes, protein-protein interactions, pathways and gene expression profiles. The proposed kernel-based MRF method is evaluated by the leave-one-out cross validation paradigm, achieving an AUC score of 0.771 when integrating all those biological data in our experiments, which indicates that our proposed method is very promising compared with many existing methods.
TL;DR: The supraHex map can tell inherent relations between replication timing, CpG and expression and can be overlaid by additional data for multilayer omics data comparisons.
TL;DR: A statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector of protein classification, and shows significant improvement in terms of performance measure metrics.
Abstract: Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.
TL;DR: It is suggested that the Dirichlet process mixture (DPM) model provides a useful and practical tool for conservation biologists and epidemiologists that can be used to inform management decisions and public health policy.
Abstract: 1. Geographic profiling (GP) was originally developed as an analytical tool in criminology, where it uses the spatial locations of linked crimes (for example murder, rape or arson) to identify areas that are most likely to include the offender's residence. The technique has been extremely successful in this field and is now widely used by police forces and investigative agencies around the world. More recently, the same method has been applied to biological data, notably in spatial epidemiology, where it uses the locations of disease cases to identify infection sources: the identification of these sources is critical to control efforts of diseases such as malaria, since targeted intervention is more efficient and cost-effective than untargeted intervention. 2. Here, we solve the problem of identifying multiple sources, even when the number of sources is unknown - a requirement for many biological studies. We present a new, rigorous mathematical and computational method and show why previous Bayesian methods were often outperformed by the empirically developed criminal geographic targeting (CGT) algorithm used in criminology. 3. We use simulations and real-world examples to compare our model to both the CGT algorithm and to an existing Bayesian model. We demonstrate that our method combines the advantages of both previous methods, particularly in cases featuring large data sets and multiple sources. 4. Our approach provides an increase in search efficiency over other methods and is likely to lead to improved targeting of interventions and more efficient use of resources. We suggest that the Dirichlet process mixture (DPM) model provides a useful and practical tool for conservation biologists and epidemiologists that can be used to inform management decisions and public health policy.
TL;DR: Applications of similarity measures over networks are reviewed with a special focus on predicting protein functions, prioritizing genes related to a phenotype given a set of seed genes that have been shown to be related to the phenotype, and identification of false positives and false negatives from RNAi experiments.
Abstract: With the rapid development of biotechnologies, many types of biological data including molecular networks are now available. However, to obtain a more complete understanding of a biological system, the integration of molecular networks with other data, such as molecular sequences, protein domains and gene expression profiles, is needed. A key to the use of networks in biological studies is the definition of similarity among proteins over the networks. Here, we review applications of similarity measures over networks with a special focus on the following four problems: (i) predicting protein functions, (ii) prioritizing genes related to a phenotype given a set of seed genes that have been shown to be related to the phenotype, (iii) prioritizing genes related to a phenotype by integrating gene expression profiles and networks and (iv) identification of false positives and false negatives from RNAi experiments. Diffusion kernels are demonstrated to give superior performance in all these tasks, leading to the suggestion that diffusion kernels should be the primary choice for a network similarity metric over other similarity measures such as direct neighbors and shortest path distance.
TL;DR: The major issues that researchers commonly face when embarking on microarray or RNA-seq experiments are discussed and important aspects of experimental design are summarized to help researchers deliberate how to generate gene expression profiles with low background noise but with more interaction to facilitate novel biological discoveries in modern plant genomics.
TL;DR: A general parameter estimation process to quantitatively optimize models with qualitative data is developed and recommendations for experiments to refine model parameters and discriminate increasingly complex hypotheses are provided.
Abstract: Discovery in developmental biology is often driven by intuition that relies on the integration of multiple types of data such as fluorescent images, phenotypes, and the outcomes of biochemical assays. Mathematical modeling helps elucidate the biological mechanisms at play as the networks become increasingly large and complex. However, the available data is frequently under-utilized due to incompatibility with quantitative model tuning techniques. This is the case for stem cell regulation mechanisms explored in the Drosophila germarium through fluorescent immunohistochemistry. To enable better integration of biological data with modeling in this and similar situations, we have developed a general parameter estimation process to quantitatively optimize models with qualitative data. The process employs a modified version of the Optimal Scaling method from social and behavioral sciences, and multi-objective optimization to evaluate the trade-off between fitting different datasets (e.g. wild type vs. mutant). Using only published imaging data in the germarium, we first evaluated support for a published intracellular regulatory network by considering alternative connections of the same regulatory players. Simply screening networks against wild type data identified hundreds of feasible alternatives. Of these, five parsimonious variants were found and compared by multi-objective analysis including mutant data and dynamic constraints. With these data, the current model is supported over the alternatives, but support for a biochemically observed feedback element is weak (i.e. these data do not measure the feedback effect well). When also comparing new hypothetical models, the available data do not discriminate. To begin addressing the limitations in data, we performed a model-based experiment design and provide recommendations for experiments to refine model parameters and discriminate increasingly complex hypotheses.
TL;DR: The main goal of this paper is to provide a method for dimension reduction and classification of genetic data sets by combining Wrapper method with the proposed hybrid ranking method to embed the interaction between genes.
TL;DR: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm, and the runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively.
Abstract: Backgrounds
Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.
Methods
Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.
Result
A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.
TL;DR: In this paper, the authors present methods and systems for processing personal biological data for real-time or near-real-time application, which includes a received reference genome and a received personal genome.
Abstract: The principles of the present invention provide methods and systems for processing personal biological data for real time or near real time application. An exemplary system includes a received reference genome and a received personal genome. The genomes are accessed over a network by one or more servers. Input from one or more sensors associated with an individual or remote from the individual is used in conjunction with the individual's genomic data or the results of the comparison of the individual's genetic data and the reference genome(s) to provide real-time or near real-time suggestions, recommendations, warnings and the like in view of the sensor data and genomic data.