TL;DR: Recent applications of network thinking to the evolution of networks at the gene and protein level and to the dynamics and stability of communities are reviewed.
Abstract: Although pairwise interactions have always had a key role in ecology and evolutionary biology, the recent increase in the amount and availability of biological data has placed a new focus on the complex networks embedded in biological systems. The increased availability of computational tools to store and retrieve biological data has facilitated wide access to these data, not just by biologists but also by specialists from the social sciences, computer science, physics and mathematics. This fusion of interests has led to a burst of research on the properties and consequences of network structure in biological systems. Although traditional measures of network structure and function have started us off on the right foot, an important next step is to create biologically realistic models of network formation, evolution, and function. Here, we review recent applications of network thinking to the evolution of networks at the gene and protein level and to the dynamics and stability of communities. These studies have provided new insights into the organization and function of biological systems by applying existing techniques of network analysis. The current challenge is to recognize the commonalities in evolutionary and ecological applications of network thinking to create a predictive science of biological networks.
TL;DR: The algorithm works by assigning genes to a predefined set of model profiles that capture the potential distinct patterns that can be expected from the experiment and outperforms both general clustering algorithms and algorithms designed specifically for clustering time series gene expression data.
Abstract: Motivation: Time series expression experiments are used to study a wide range of biological systems. More than 80% of all time series expression datasets are short (8 time points or fewer). These datasets present unique challenges. On account of the large number of genes profiled (often tens of thousands) and the small number of time points many patterns are expected to arise at random. Most clustering algorithms are unable to distinguish between real and random patterns.
Results: We present an algorithm specifically designed for clustering short time series expression data. Our algorithm works by assigning genes to a predefined set of model profiles that capture the potential distinct patterns that can be expected from the experiment. We discuss how to obtain such a set of profiles and how to determine the significance of each of these profiles. Significant profiles are retained for further analysis and can be combined to form clusters. We tested our method on both simulated and real biological data. Using immune response data we show that our algorithm can correctly detect the temporal profile of relevant functional categories. Using Gene Ontology analysis we show that our algorithm outperforms both general clustering algorithms and algorithms designed specifically for clustering time series gene expression data.
Availability: Information on obtaining a Java implementation with a graphical user interface (GUI) is available from http://www.cs.cmu.edu/~jernst/st/
Contact: jernst@cs.cmu.edu
Supplementary information: Available at http://www.cs.cmu.edu/~jernst/st/
TL;DR: It is observed that the essentiality of reactions in a node is not correlated with node connectivity as structural analyses of other biological networks have suggested and fundamental differences among different biological networks arising out of their representation and functional constraints are suggested.
TL;DR: This chapter introduces five major paradigms for machine learning and data analysis including: probabilistic and statistical models, symbolic learning and rule induction, neural networks, evolution-based algorithms, and analytic learning and fuzzy logic, and discusses their relevance and potential for biomedical research.
Abstract: In this chapter we provide a broad overview of selected knowledge management, data mining, and text mining techniques and their use in various emerging biomedical applications. It aims to set the context for subsequent chapters. We first introduce five major paradigms for machine learning and data analysis including: probabilistic and statistical models, symbolic learning and rule induction, neural networks, evolution-based algorithms, and analytic learning and fuzzy logic. We also discuss their relevance and potential for biomedical research. Example applications of relevant knowledge management, data mining, and text mining research are then reviewed in order including: ontologies; knowledge management for health care, biomedical literature, heterogeneous databases, information visualization, and multimedia databases; and data and text mining for health care, literature, and biological data. We conclude the paper with discussions of privacy and confidentiality issues of relevance to biomedical data mining.
TL;DR: This paper describes BioFerret, a meta-search tool for searching and filtering relevant information from the web, and ALFA Text Viewer, an interactive tool for user-guided extraction, disambiguation, and representation of information from scientific text.
Abstract: Motivations: Technological advances in biomedical research are generating a plethora of heterogeneous data at a high rate. There is a critical need for extraction, integration and management tools for information discovery and synthesis from these heterogeneous data.
Results: In this paper, we present a general architecture, called ALFA, for information extraction and representation from diverse biological data. The ALFA architecture consists of: (i) a networked, hierarchical, hyper-graph object model for representing information from heterogeneous data sources in a standardized, structured format; and (ii) a suite of integrated, interactive software tools for information extraction and representation from diverse biological data sources. As part of our research efforts to explore this space, we have currently prototyped the ALFA object model and a set of interactive software tools for searching, filtering, and extracting information from scientific text. In particular, we describe BioFerret, a meta-search tool for searching and filtering relevant information from the web, and ALFA Text Viewer, an interactive tool for user-guided extraction, disambiguation, and representation of information from scientific text. We further demonstrate the potential of our tools in integrating the extracted information with experimental data and diagrammatic biological models via the common underlying ALFA representation.
Contact: aditya_vailaya@agilent.com
TL;DR: The aim of this book is to introduce the reader to some of the best techniques for data mining in bioinformatics in the hope that the reader will build on them to make new discoveries on their own.
Abstract: The aim of this book is to introduce the reader to some of the best techniques for data mining in bioinformatics in the hope that the reader will build on them to make new discoveries on his or her own. The book contains twelve chapters in four parts, namely, overview, sequence and structure alignment, biological data mining, and biological data management. This chapter provides an introduction to the field and describes how the chapters in the book relate to one another.
TL;DR: This paper proposes a new method based on spectral clustering for the partitioning of genes according to their biological function based on Gene Ontology annotation, which promises to automates, speed up and therefore improve biological data analysis.
Abstract: With the invention of high throughput methods, researchers are capable of producing large amounts of biological data. During the analysis of such data the need for a functional grouping of genes arises. In this paper, we propose a new method based on spectral clustering for the partitioning of genes according to their biological function. The functional information is based on Gene Ontology annotation, a mechanism to capture functional knowledge in a shareable and computer processable form. Our functional cluster method promises to automates, speed up and therefore improve biological data analysis.
TL;DR: The method has been applied fruitfully to both phylogenetic and microarray expression data, and has been used to associate logical combinations of protein activity with disease state phenotypes, revealing previously unknown ternary relationships among proteins, and illustrating the inherent complexities that arise in biological data.
Abstract: The wealth of available genomic data has spawned a corresponding interest in computational methods that can impart biological meaning and context to these experiments Traditional computational methods have drawn relationships between pairs of proteins or genes based on notions of equality or similarity between their patterns of occurrence or behavior For example, two genes displaying similar variation in expression, over a number of experiments, may be predicted to be functionally related We have introduced a natural extension of these approaches, instead identifying logical relationships involving triplets of proteins Triplets provide for various discrete kinds of logic relationships, leading to detailed inferences about biological associations For instance, a protein C might be encoded within an organism if, and only if, two other proteins A and B are also both encoded within the organism, thus suggesting that gene C is functionally related to genes A and B The method has been applied fruitfully to both phylogenetic and microarray expression data, and has been used to associate logical combinations of protein activity with disease state phenotypes, revealing previously unknown ternary relationships among proteins, and illustrating the inherent complexities that arise in biological data
TL;DR: The chapter shows how biological knowledge pertaining to the organisms under study can be used to impose certain constraints on the models of variance–covariance structure and suggests ways to integrate such constraints into the proposal of several less generalized models.
Abstract: Publisher Summary This chapter discusses the standard approaches that have been used to estimate variation in landmark data and explains why these methods do not properly estimate variation in biological forms. Though some of these approaches have become rather mathematically involved, their suitability to the realities of biological data has not improved in parallel. The chapter presents a generalized model for variation in landmark data. It has been shown that only certain features of this model can be consistently estimated. This model and the estimators are used in morphometric analysis as the basis for parametric bootstrapping procedures to test for differences in form using landmark data. Recognizing the limitation of the ability to estimate only certain features of this general model of variation, it discusses the need for the development of less general models that may reasonably characterize variation in landmark data. The chapter shows how biological knowledge pertaining to the organisms under study can be used to impose certain constraints on the models of variance–covariance structure. It suggests ways to integrate such constraints into the proposal of several less generalized models, some statistically convenient but biologically improbable, others less streamlined statistically but more biologically reasonable.
TL;DR: A method that covers the main steps and components involved in similarity-based grouping procedures in biological data sources and ontologies is proposed, and an environment that supports comparison and evaluation of different grouping strategies is developed.
Abstract: Data integration is an important procedure underlying many research tasks in the life sciences, as often multiple data sources have to be accessed to collect the relevant data. The data sources vary in content, data format, and access methods, which often vastly complicates the data retrieval process. As a result, the task of retrieving data requires a great deal of effort and expertise on the part of the user. To alleviate these difficulties, various information integration systems have been proposed in the area. However, a number of issues remain unsolved and new integration solutions are needed. The work presented in this thesis considers data integration at three different levels. 1) Integration of biological data sources deals with integrating multiple data sources from an information integration system point of view. We study properties of biological data sources and existing integration systems. Based on the study, we formulate requirements for systems integrating biological data sources. Then, we define a query language that supports queries commonly used by biologists. Also, we propose a high-level architecture for an information integration system that meets a selected set of requirements and that supports the specified query language. 2) Integration of ontologies deals with finding overlapping information between ontologies. We develop and evaluate algorithms that use life science literature and take the structure of the ontologies into account. 3) Grouping of biological data entries deals with organizing data entries into groups based on the computation of similarity values between the data entries. We propose a method that covers the main steps and components involved in similarity-based grouping procedures. The applicability of the method is illustrated by a number of test cases. Further, we develop an environment that supports comparison and evaluation of different grouping strategies. The work is supported by the implementation of: 1) a prototype for a system integrating biological data sources, called BioTRIFU, 2) algorithms for ontology alignment, and 3) an environment for evaluating strategies for similarity-based grouping of biological data, called KitEGA.
TL;DR: This methodology allows the identification of relationships between low-magnitude similarity (LMS) sequence patterns and other well-contrasted protein characteristics, such as those described by database annotations, to produce previously unknown and useful knowledge in the area of biological sequence analysis.
Abstract: We present a novel application of knowledge discovery technology to a developing and challenging application area such as bioinformatics. This methodology allows the identification of relationships between low-magnitude similarity (LMS) sequence patterns and other well-contrasted protein characteristics, such as those described by database annotations. We start with the identification of these signals inside protein sequences by exhaustive database searching and automatic pattern recognition strategies. In a second step we address the discovering of association rules that will allow tagging sequences that hold LMS signals with consequent functional keywords. We have designed our own algorithm for discovering association rules, meeting the special necessities of bioinformatics problems, where the patterns we search lie in sparse datasets and are uncommon and thus difficult to locate. Computational efficiency has been verified both with synthetic and real biological data showing that the algorithm is well suited to this application area compared to state of the art algorithms. The usefulness of the method is confirmed by its ability to produce previously unknown and useful knowledge in the area of biological sequence analysis. In addition, we introduce a new and promising application of the rule extraction algorithm on gene expression databases.
TL;DR: INDUS framework was used to design algorithms for learning probabilistic models for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources.
Abstract: We present INDUS (Intelligent Data Understanding System), a federated, query-centric system for knowledge acquisition from autonomous, distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user or an application to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We used INDUS framework to design algorithms for learning probabilistic models (e.g., Naive Bayes models) for predicting GO functional classification of a protein based on training sequences that are distributed among SWISSPROT and MIPS data sources. Mappings such as EC2GO and MIPS2GO were used to resolve the semantic differences between these data sources when answering queries posed by the learning algorithms. Our results show that INDUS can be successfully used for integrative analysis of data from multiple sources needed for collaborative discovery in computational biology.
TL;DR: A general architecture and some building blocks for the implementation of a system that is able to remotely execute workflows of biomedical interest are presented and how this approach can effectively produce useful outputs are shown.
Abstract: The completion of the Human Genome Project has resulted in large quantities of biological data which are proving difficult to manage and integrate effectively. There is a need for a system that is able to automate accesses to remote sites and to "understand" the information that it is managing in order to link data properly. Workflow management systems combined with Web Services are promising Information and Communication Technologies (ICT) tools. Some have already been proposed and are being increasingly applied to the biomedical domain, especially as many biology-related Web Services are now becoming available. Information on biological resources and on genomic sequences mutations are two examples of very specialized datasets that are useful for specific research domains. The architecture of a system that is able to access and execute predefined workflows is presented in this paper. Web Services allowing access to the IARC TP53 Mutation Database and CABRI catalogues of biological resources have been implemented and are available on-line. Example workflows which retrieve data from these Web Services have also been created and are available on-line. We present a general architecture and some building blocks for the implementation of a system that is able to remotely execute workflows of biomedical interest and show how this approach can effectively produce useful outputs. The further development and implementation of Web Services allowing access to an exhaustive set of biomedical databases and the creation of effective and useful workflows will improve the automation of in-silico analysis.
TL;DR: Gene network technique holds the promise of providing a conceptual framework for analysis of the profusion of biological data being generated on potential drug targets and providing insights to understand the biological regulatory mechanisms in diseases.
Abstract: The complete genome sequences have provided a plethora of potential drug targets. Gene network technique holds the promise of providing a conceptual framework for analysis of the profusion of biological data being generated on potential drug targets and providing insights to understand the biological regulatory mechanisms in diseases, which are playing an increasingly important role in searching for novel drug targets from the information contained in genomics. In this paper, we discuss some of the network-based approaches for identifying drug targets, with the emphasis on the gene network strategy. In addition, some of the relevant data resources and computational tools are given.
TL;DR: GeneKeyDB is a gene-centered relational database developed to enhance data mining in biological data sets and can enable analysis tools and users to manipulate the intersections, unions, and differences among different data sets.
Abstract: Background
The analysis of biological data is greatly enhanced by existing or emerging databases. Most existing databases, with few exceptions are not designed to easily support large scale computational analysis, but rather offer exclusively a web interface to the resource. We have recognized the growing need for a database which can be used successfully as a backend to computational analysis tools and pipelines. Such database should be sufficiently versatile to allow easy system integration.
TL;DR: This paper presents a distributed algorithm for evaluating generalized path queries, which are considered the basic querying mechanism for semistructured data, and follows a realistic model with distributed (non-shared) memory and message-passing between processors.
Abstract: Nowadays, we are required to deal with more complex data, prime examples of which are data on the Web, XML data, biological data, etc. There are already proposed abstractions to handle these kinds of data, in particular in terms of semistructured data models. A semistructured model conceives a database essentially as a finite directed labeled graph whose nodes represent objects, and whose edges represent relationships between objects. In this paper, we focus on path queries, which are considered the basic querying mechanism for semistructured data. In essence, such queries are used to navigate, or discover paths that conform to specifications captured by regular expressions. In order to make the navigation more useful, we consider generalized path queries, in which the symbols could optionally be weighted by numbers. Such numbers can express a variety of information about the data that the query could possibly match or navigate.Motivated by the plethora of today's applications utilizing Web services and peer-to-peer architectures, we present a distributed algorithm for evaluating generalized path queries. We follow a realistic model with distributed (non-shared) memory and message-passing between processors. An optimal solution to the problem lies in the intersection of ideas related to distributed query evaluation, distributed shortest path computation, and queueing systems.
TL;DR: This work proposes a new layout strategy, called Stellar, that provides significantly improved search performance on a representative set of real genomic sequences, and supports both the standard root-to-leaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffix-links of suffix-trees.
Abstract: Suffix-trees are popular indexing structures for various sequence processing problems in biological data management. We investigate here the possibility of enhancing the search efficiency of disk-resident suffix-trees through customized layouts of tree-nodes to disk-pages. Specifically, we propose a new layout strategy, called Stellar, that provides significantly improved search performance on a representative set of real genomic sequences. Further, Stellar supports both the standard root-to-leaf lookup queries as well as sophisticated sequencesearch algorithms that exploit the suffix-links of suffix-trees. Our results are encouraging with regard to the ultimate objective of seamlessly integrating sequence processing in database engines.
TL;DR: This chapter describes the computational problems and their solutions in automated extraction of relationships among different biological entities, and presents some recent advances made in this area.
Abstract: Identification of relationships among different biological entities, e.g., genes, proteins, diseases, drugs and chemicals, etc, is an important problem for biological researchers. While such information can be extracted from different types of biological data (e.g., gene and protein sequences, protein structures), a significant source of such knowledge is the biological textual research literature which is increasingly being made available as large-scale public-domain electronic databases (e.g., the Medline database). Automated extraction of such relationships (e.g., gene A inhibits protein B) from textual data can significantly enhance biological research productivity by keeping researchers up-to-date with the state-of-the-art in their research domain, by helping them visualize biological pathways, and by generating likely new hypotheses concerning novel interactions some of which can be good candidates for further biological research and validation. In this chapter, we describe the computational problems and their solutions in such automated extraction of relationships, and present some recent advances made in this area.
TL;DR: A data structure called the inexact-suffix tree is introduced and an algorithm based on this data structure is presented, which has been tested on primarily biological data such as DNA and protein sequences, but is equally applicable in other data mining, clustering, and knowledge extraction applications.
TL;DR: The use of the algorithm to address problems in prokaryotic taxonomy is discussed and an algorithm for automated classification based on evolutionary distance data was written in S.mmg.msu.edu.
Abstract: Motivation: Rapid, automated means of organizing biological data are required if we hope to keep abreast of the flood of data emanating from sequencing, microarray and similar high-throughput analyses. Faced with the need to validate the annotation of thousands of sequences and to generate biologically meaningful classifications based on the sequence data, we turned to statistical methods in order to automate these processes.
Results: An algorithm for automated classification based on evolutionary distance data was written in S. The algorithm was tested on a dataset of 1436 small subunit ribosomal RNA sequences and was able to classify the sequences according to an extant scheme, use statistical measurements of group membership to detect sequences that were misclassified within this scheme and produce a new classification. In this study, the use of the algorithm to address problems in prokaryotic taxonomy is discussed.
Availability: S-Plus is available from Insightful, Inc. An S-Plus implementation of the algorithm and the associated data are available at http://taxoweb.mmg.msu.edu/datasets
Contact: garrity@msu.edu
TL;DR: This paper presents two cluster validity indices intended to evaluate clusterings of gene expression data in a biological manner.
Abstract: With the invention of biotechnological high throughput methods like DNA microarrays and the analysis of the resulting huge amounts of biological data, clustering algorithms gain new popularity. In practice the question arises, which clustering algorithm as well as which parameter set generates the most promising results. Little work is addressed to the question of evaluating and comparing the clustering results, especially according to their biological relevance, as well on distinguishing biologically interesting clusters from less interesting ones. This paper presents two cluster validity indices intended to evaluate clusterings of gene expression data in a biological manner.
TL;DR: This paper presents the Bio2X system that transforms flat file data into highly hierarchical XML data using rule-based machine learning technique and demonstrates the effectiveness of theBio2X approach.
Abstract: Data integration of geographically dispersed, heterogeneous, complex biological databases is a key research area. One of the key features of a successful data integration system is to have a simple self-describing data exchange format. However, many of the biological databases provide data in flat files which are poor data exchange formats. Fortunately, XML can be viewed as a powerful data model and better data exchange format. In this paper, we present the Bio2X system that transforms flat file data into highly hierarchical XML data using rule-based machine learning technique. Bio2X has been fully implemented using Java. Our experiments to transform real world biological data demonstrate the effectiveness of the Bio2X approach.
TL;DR: The first prototype of INDUS (intelligent data understanding system), a federated, query-centric system for information integration and knowledge acquisition from distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables, is presented.
Abstract: We present the first prototype of INDUS (intelligent data understanding system), a federated, query-centric system for information integration and knowledge acquisition from distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables INDUS employs ontologies and inter-ontology mappings, to enable a user to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology
TL;DR: The NDA mappings generated by these networks, together with unsupervised representations of the data, lead to a deeper understanding of complex high dimensional data like leukemia and Alzheimer gene expression microarray experiments.
Abstract: A hybrid stochastic-deterministic approach for solving NDA problems on very high dimensional biological data is investigated. It is based on networks trained with a combination of simulated annealing and conjugate gradient within a broad scale, high throughput computing data mining environment. High quality networks from the point of view of both discrimination and generalization capabilities are discovered. The NDA mappings generated by these networks, together with unsupervised representations of the data, lead to a deeper understanding of complex high dimensional data like leukemia and Alzheimer gene expression microarray experiments.
TL;DR: A new tool is presented, called BlastQuest, which relies on database technology and provides sophisticated interactive and Web-enabled query, analysis, and visualization facilities for genomics data and briefly sketches a new concept, called Genomies Algebra, for solving genomic data management problems from a broader perspective.
Abstract: Besides domain-specific biological problems, biologists are confronted with many computational problems. The large amount of varying, heterogeneous, and semi-structured biological data, the increasing complexity of biological applications, methods, and tools afflicted with uncertainty and missing knowledge, as well as the lacking interoperability of available tools necessitate integrative measures to enable biology workflow. In this paper we address these problems in the context of the processing and evaluation of BLAST query results. We present a new tool, called BlastQuest, which relies on database technology and provides sophisticated interactive and Web-enabled query, analysis, and visualization facilities for genomics data. The interface with the Gene Ontology and the KEGG pathway databases decisively foster the biological workflow. Finally, based on our experience with BlastQuest, we briefly sketch a new concept, called Genomies Algebra, for solving genomic data management problems from a broader perspective.
TL;DR: In this paper, a method for analyzing biological data includes classifying a first set of biological data in a first classifier, classifying the second set in a second classifier and analyzing the results as a function of the similarity measures of the first classifiers and the similarity measure of the second classifiers.
Abstract: A method for analyzing biological data includes classifying a first set of biological data in a first classifier, classifying a second set of biological data in a second classifier, combining the results of the first classifier with the results of the second classifier, and analyzing the results as a function of the similarity measure of the first classifier and the similarity measure of the second classifier
TL;DR: This work discusses how this challenge was addressed in the course of developing the Integrated Microbial Genomes (IMG) system for comparative analysis of microbial genome data.
Abstract: Biological data management includes the traditional areas of data generation, acquisition, modelling, integration, and analysis. Although numerous academic biological data management systems are currently available, employing them effectively remains a significant challenge. We discuss how this challenge was addressed in the course of developing the Integrated Microbial Genomes (IMG) system for comparative analysis of microbial genome data.
TL;DR: The Rat Genome Database has developed a comprehensive ontology-based data structure and annotation system to integrate physiological data along with environmental and experimental factors, as well as genetic and genomic information.
Abstract: The strength of the rat as a model organism lies in its utility in pharmacology, biochemistry and physiology research. Data resulting from such studies is difficult to represent in databases and the creation of user-friendly data mining tools has proved difficult. The Rat Genome Database has developed a comprehensive ontology-based data structure and annotation system to integrate physiological data along with environmental and experimental factors, as well as genetic and genomic information. RGD uses multiple ontologies to integrate complex biological information from the molecular level to the whole organism, and to develop data mining and presentation tools. This approach allows RGD to indicate not only the phenotypes seen in a strain but also the specific values under each diet and atmospheric condition, as well as gender differences. Harnessing the power of ontologies in this way allows the user to gather and filter data in a customized fashion, so that a researcher can retrieve all phenotype readings for which a high hypoxia is a factor. Utilizing the same data structure for expression data, pathways and biological processes, RGD will provide a comprehensive research platform which allows users to investigate the conditions under which biological processes are altered and to elucidate the mechanisms of disease.
TL;DR: In this article, the authors proposed a method to prevent leakage of biological data by performing authentication processing, while making confidential the biological data which are personal information of a user, as they are.
Abstract: PROBLEM TO BE SOLVED: To easily prevent leakage of biological data by performing authentication processing, while making confidential the biological data which are personal information of a user, as they are. SOLUTION: Similarity (normalized Hamming distance HD irs ) between registered data f(A k , r k ), resulting from converting biological data by a conversion means and data f(B k , r k ) to be authenticated resulting from converting the biological data by the conversion means is calculated and decided, to authenticate the data according to the result of the decision. COPYRIGHT: (C)2005,JPO&NCIPI