TL;DR: WikiPathways provides a collaborative platform for creating, updating, and sharing pathway diagrams and serves as an example of content curation by the biology community.
Abstract: The exponential growth of diverse types of biological data presents the research community with an unprecedented challenge and opportunity. The challenge is to stay afloat in the flood of biological data, keeping it as accessible, up-to-date, and integrated as possible. The opportunity is to cultivate new models of data curation and exchange that take advantage of direct participation by a greater portion of the community.
This combination of challenge and opportunity is especially relevant to the task of collecting biological pathway information. Pathways are critical to understanding the functions of individual genes and proteins in terms of systems and processes that contribute to normal physiology and to disease. Each biological pathway must be hewn from a mass of biological information distributed across multiple publications and databases.
The particular challenge of pathway curation is amplified, because pathways are often presented as static images that are not amenable to computation, integration, or data exchange. Furthermore, pathway experts are distributed throughout the world, and most have limited time to learn about complex databases that need their expertise. This challenge can be met by taking the opportunity to develop a new community-based model for pathway curation.
One way to engage the community is with a wiki model, as exemplified by Wikipedia [1]. We see the potential for a wiki-based pathway curation resource, coupled with an embedded graphical pathway editing tool, to meet the growing challenge presented by the influx of biological data and to provide an innovative example of content curation by the biology community (Figure 1).
Figure 1
Two Models for Managing Biological Data
TL;DR: This book the analysis of biological data will appear in printed and soft file collection and many people sometimes have no space to bring the book for them; this is why they can't read the book wherever they want.
Abstract: Imagine that you get such certain awesome experience and knowledge by only reading a book. How can? It seems to be greater when a book can be the best thing to discover. Books now will appear in printed and soft file collection. One of them is this book the analysis of biological data. It is so usual with the printed books. However, many people sometimes have no space to bring the book for them; this is why they can't read the book wherever they want.
TL;DR: This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications.
Abstract: Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet http://zope.bioinfo.cnio.es/bionlp_tools/.
TL;DR: This work demonstrates that MI3 outperforms the frequently used control methods, and provides a powerful method for inferring mechanistic relationships underlying biological and other complex systems.
Abstract: Probability based statistical learning methods such as mutual information and Bayesian networks have emerged as a major category of tools for reverse engineering mechanistic relationships from quantitative biological data. In this work we introduce a new statistical learning strategy, MI3 that addresses three common issues in previous methods simultaneously: (1) handling of continuous variables, (2) detection of more complex three-way relationships and (3) better differentiation of causal versus confounding relationships. With these improvements, we provide a more realistic representation of the underlying biological system.
TL;DR: In this article, the authors present a methodology that jointly analyzes two sets of microarray data, one in the presence and one in absence of a disease, identifying gene pairs whose correlation with disease is due to cooperative, rather than independent, contributions of genes, using the recently developed information theoretic measure of synergy.
Abstract: Analysis of microarray data has been used for the inference of gene-gene interactions. If, however, the aim is the discovery of disease-related biological mechanisms, then the criterion for defining such interactions must be specifically linked to disease. Here we present a computational methodology that jointly analyzes two sets of microarray data, one in the presence and one in the absence of a disease, identifying gene pairs whose correlation with disease is due to cooperative, rather than independent, contributions of genes, using the recently developed information theoretic measure of synergy. High levels of synergy in gene pairs indicates possible membership of the two genes in a shared pathway and leads to a graphical representation of inferred gene-gene interactions associated with disease, in the form of a "synergy network." We apply this technique on a set of publicly available prostate cancer expression data and successfully validate our results, confirming that they cannot be due to pure chance and providing a biological explanation for gene pairs with exceptionally high synergy. Thus, synergy networks provide a computational methodology helpful for deriving "disease interactomes" from biological data. When coupled with additional biological knowledge, they can also be helpful for deciphering biological mechanisms responsible for disease.
TL;DR: Experiments show that execution time and memory usage of GenMiner are significantly smaller than those of the standard Apriori-based approach, as well as the number of extracted association rules.
Abstract: GenMiner is an implementation of association rule discovery dedicated to the analysis of genomic data. It allows the analysis of datasets integrating multiple sources of biological data represented as both discrete values, such as gene annotations, and continuous values, such as gene expression measures. GenMiner implements the new NorDi (normal discretization) algorithm for normalizing and discretizing continuous values and takes advantage of the Close algorithm to efficiently generate minimal non-redundant association rules. Experiments show that execution time and memory usage of GenMiner are significantly smaller than those of the standard Apriori-based approach, as well as the number of extracted association rules.
TL;DR: A genetic weighted K-means algorithm (denoted by GWKMA), which has general application to clustering large-scale biological data such as gene expression data and peptide mass spectral data.
Abstract: The traditional (unweighted) k-means is one of the most popular clustering methods for analyzing gene expression data. However, it suffers three major shortcomings. It is sensitive to initial partitions, its result is prone to the local minima, and it is only applicable to data with spherical-shape clusters. The last shortcoming means that we must assume that gene expression data at the different conditions follow the independent distribution with the same variances. Nevertheless, this assumption is not true in practice. In this paper, we propose a genetic weighted K-means algorithm (denoted by GWKMA), which solves the first two problems and partially remedies the third one. GWKMA is a hybridization of a genetic algorithm (GA) and a weighted K-means algorithm (WKMA). In GWKMA, each individual is encoded by a partitioning table which uniquely determines a clustering, and three genetic operators (selection, crossover, mutation) and a WKM operator derived from WKMA are employed. The superiority of the GWKMA over the k-means is illustrated on a synthetic and two real-life gene expression datasets. The proposed algorithm has general application to clustering large-scale biological data such as gene expression data and peptide mass spectral data.
TL;DR: A combined model comprising three modules, one of which represents the early lineage of the somatic gonad of Caenorhabditis elegans in LSCs, and another represents an interaction between two cells within this lineage that determine their developmental outcome is tested.
Abstract: The last several decades have witnessed a vast accumulation of biological data and data analysis. Many of these data sets represent only a small fraction of the system's behavior, making the visualization of full system behavior difficult. A more complete understanding of a biological system is gained when different types of data (and/or conclusions drawn from the data) are integrated into a larger scale representation or model of the system. Ideally, this type of model is consistent with all available data about the system, and it is then used to generate additional hypotheses to be tested. Computer-based methods intended to formulate models that integrate various events and to test the consistency of these models with respect to the laboratory-based observations on which they are based are potentially very useful. In addition, in contrast to informal models, the consistency of such formal computer-based models with laboratory data can be tested rigorously by methods of formal verification. We combined two formal modeling approaches in computer science that were originally developed for nonbiological system design. One is the interobject approach using the language of live sequence charts (LSCs) with the Play-Engine tool, and the other is the intraobject approach using the language of statecharts and Rhapsody as the tool. Integration is carried out using InterPlay, a simulation engine coordinator. Using these tools, we constructed a combined model comprising three modules. One module represents the early lineage of the somatic gonad of Caenorhabditis elegans in LSCs, whereas a second more detailed module in statecharts represents an interaction between two cells within this lineage that determine their developmental outcome. Using the advantages of the tools, we created a third module representing a set of key experimental data using LSCs. We tested the combined statechart-LSC model by showing that the simulations were consistent with the set of experimental LSCs. This small-scale modular example demonstrates the potential for using similar approaches for verification by exhaustive testing of models by LSCs. It also shows the advantages of these approaches for modeling biology.
TL;DR: The database REBECCA as mentioned in this paper contains the biological elements phytoplankton, macrophytes, macroinvertebrates and fish, together with relevant chemistry data and station information.
Abstract: Chemical and biological data from more than 5,000 lakes in 20 European countries have been compiled into databases within the EU project REBECCA. The project’s purpose was to provide scientific support for implementation of the EU Water Framework Directive (WFD). The databases contain the biological elements phytoplankton, macrophytes, macroinvertebrates and fish, together with relevant chemistry data and station information. The common database strategy has enabled project partners to perform analyses of chemical–biological relationships and to describe reference conditions for large geographic regions in Europe. This strategy has obvious benefits compared with single-country analyses: results will be more representative for larger European regions, and the statistical power and precision will be larger. The high number of samples within some regions has also enabled analysis of type-specific relationships for several lake types. These results are essential for the intercalibration of ecological assessment systems for lakes, as required by the WFD. However, the common database approach has also involved costs and limitations. The data process has been resource-demanding, and the requirements for a flexible database structure have made it less user-friendly for project partners. Moreover, there are considerable heterogeneities among datasets from different countries regarding sampling methods and taxonomic precision; this may reduce comparability of the data and increase the uncertainty of the results. This article gives an overview of the contents and functions of the REBECCA Lakes databases, and of our experiences from constructing and using the databases. We conclude with recommendations for compilation of environmental data for future international projects.
TL;DR: The rapid accumulation of biological data, its various organizational structures, and the role that ontologies play in interoperability are described.
Abstract: The past twenty years have witnessed an explosion of biological data in diverse database formats governed by heterogeneous infrastructures Not only are semantics (attribute terms) different in meaning across databases, but their organization varies widely Ontologies are a concept imported from computing science to describe different conceptual frameworks that guide the collection, organization and publication of biological data An ontology is similar to a paradigm but has very strict implications for formatting and meaning in a computational context The use of ontologies is a means of communicating and resolving semantic and organizational differences between biological databases in order to enhance their integration The purpose of interoperability (or sharing between divergent storage and semantic protocols) is to allow scientists from around the world to share and communicate with each other This paper describes the rapid accumulation of biological data, its various organizational structures, and the role that ontologies play in interoperability
TL;DR: A computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs is introduced, useful for inferring small NM-based modules of TF- target gene relationships that can serve as a basis for generating new testable hypotheses.
Abstract: Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network (TRN) with strong similarity to the structure of the underlying genetic regulatory modules. Decomposing the TRN into a small set of recurring regulatory patterns, called network motifs (NM), facilitates the inference. Identifying NMs defined by specific transcription factors (TF) establishes the framework structure of a TRN and allows the inference of TF-target gene relationship. This paper introduces a computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs. The data include time course gene expression profiles, genome-wide location analysis data, binding sequence data, and gene ontology (GO) information. The proposed computational framework was tested using gene expression data associated with cell cycle progression in yeast. Among 800 cell cycle related genes, 85 were identified as candidate TFs and classified into four previously defined NMs. The NMs for a subset of TFs are obtained from literature. Support vector machine (SVM) classifiers were used to estimate NMs for the remaining TFs. The potential downstream target genes for the TFs were clustered into 34 biologically significant groups. The relationships between TFs and potential target gene clusters were examined by training recurrent neural networks whose topologies mimic the NMs to which the TFs are classified. The identified relationships between TFs and gene clusters were evaluated using the following biological validation and statistical analyses: (1) Gene set enrichment analysis (GSEA) to evaluate the clustering results; (2) Leave-one-out cross-validation (LOOCV) to ensure that the SVM classifiers assign TFs to NM categories with high confidence; (3) Binding site enrichment analysis (BSEA) to determine enrichment of the gene clusters for the cognate binding sites of their predicted TFs; (4) Comparison with previously reported results in the literatures to confirm the inferred regulations. The major contribution of this study is the development of a computational framework to assist the inference of TRN by integrating heterogeneous data from multiple sources and by decomposing a TRN into NM-based modules. The inference capability of the proposed framework is verified statistically (e.g., LOOCV) and biologically (e.g., GSEA, BSEA, and literature validation). The proposed framework is useful for inferring small NM-based modules of TF-target gene relationships that can serve as a basis for generating new testable hypotheses.
TL;DR: The techniques of data mining are introduced to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes and represent a first step in the automatic analysis of3D phenotypes.
Abstract: The ability to analyze and classify three-dimensional (3D) biological morphology has lagged behind the analysis of other biological data types such as gene sequences. Here, we introduce the techniques of data mining to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes. We compiled five training sets of highly variable morphologies of mammalian teeth from the MorphoBrowser database. Samples were labeled either by dietary class or by conventional dental types (e.g. carnassial, selenodont). We automatically extracted a multitude of topological attributes using Geographic Information Systems (GIS)-like procedures that were then used in several combinations of feature selection schemes and probabilistic classification models to build and optimize classifiers for predicting the labels of the training sets. In terms of classification accuracy, computational time and size of the feature sets used, non-repeated best-first search combined with 1-nearest neighbor classifier was the best approach. However, several other classification models combined with the same searching scheme proved practical. The current study represents a first step in the automatic analysis of 3D phenotypes, which will be increasingly valuable with the future increase in 3D morphology and phenomics databases.
TL;DR: The results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records.
Abstract: Background
The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation.
TL;DR: The Therapy Imaging and Model Management System (TIMMS) as discussed by the authors provides a process and system for a comprehensive surgical assist system, which combines and integrates all of the necessary information and communication technology; workflow analysis, data processing and data synthesis; interactive interfaces between surgeon and mechatronic devices; and cognitive agents; to provide comprehensive assistance and guidance throughout complex medical and surgical therapies, such as image guided surgery.
Abstract: This invention provides a process and system for a comprehensive surgical assist system, called a Therapy Imaging and Model Management System (TIMMS), which combines and integrates all of the necessary information and communication technology; workflow analysis, data processing and data synthesis; interactive interfaces between surgeon and mechatronic devices; and, cognitive agents; to provide comprehensive assistance and guidance throughout complex medical and surgical therapies, such as image guided surgery. The components of this invention, which are modular, scalable and may be distributed in location, act synergistically to provide functionality and utility that exceeds the sum of its individual parts. A method of performing surgery on a patient comprising the step of comparing a chosen patient's data to statistical data in a repository of patient data to develop a patient specific model, wherein the data comprises information from two or more sub databases selected from the group consisting of workflow data, electronic medical records, diagnostic data, biological data, measurement data, anatomical data, physiological data, genetic data, molecular data, imaging data, chemical data, clinical laboratory data, simulated data, coordinate data and surgical result and wherein the patient specific model aids in the preoperative, operative or post operative phase of surgery performed in real time on the patient.
TL;DR: A novel fuzzy methodology based on a fuzzy association rule mining method for biological knowledge extraction is proposed over a yeast genome dataset containing heterogeneous information regarding structural and functional genome features.
Abstract: Last years' mapping of diverse genomes has generated huge amounts of biological data which are currently dispersed through many databases. Integration of the information available in the various databases is required to unveil possible associations relating already known data. Biological data are often imprecise and noisy. Fuzzy set theory is specially suitable to model imprecise data while association rules are very appropriate to integrate heterogeneous data. In this work we propose a novel fuzzy methodology based on a fuzzy association rule mining method for biological knowledge extraction. We apply this methodology over a yeast genome dataset containing heterogeneous information regarding structural and functional genome features. A number of association rules have been found, many of them agreeing with previous research in the area. In addition, a comparison between crisp and fuzzy results proves the fuzzy associations to be more reliable than crisp ones. An integrative approach as the one carried out in this work can unveil significant knowledge which is currently hidden and dispersed through the existing biological databases. It is shown that fuzzy association rules can model this knowledge in an intuitive way by using linguistic labels and few easy-understandable parameters.
TL;DR: An automated approach involving an in silico experimental workflow tool, Taverna, is developed using web services, BioMart and MOLGENIS technologies for data retrieval, and focuses on the current impediments to adopting such an approach in a wider context, and strategies to overcome them.
Abstract: The torrent of data emerging from the application of new technologies to functional genomics and systems biology can no longer be contained within the traditional modes of data sharing and publication with the consequence that data is being deposited in, distributed across and disseminated through an increasing number of databases. The resulting fragmentation poses serious problems for the model organism community which increasingly rely on data mining and computational approaches that require gathering of data from a range of sources. In the light of these problems, the European Commission has funded a coordination action, CASIMIR (coordination and sustainability of international mouse informatics resources), with a remit to assess the technical and social aspects of database interoperability that currently prevent the full realization of the potential of data integration in mouse functional genomics. In this article, we assess the current problems with interoperability, with particular reference to mouse functional genomics, and critically review the technologies that can be deployed to overcome them. We describe a typical use-case where an investigator wishes to gather data on variation, genomic context and metabolic pathway involvement for genes discovered in a genome-wide screen. We go on to develop an automated approach involving an in silico experimental workflow tool, Taverna, using web services, BioMart and MOLGENIS technologies for data retrieval. Finally, we focus on the current impediments to adopting such an approach in a wider context, and strategies to overcome them.
TL;DR: Improved predictive ability of the different classes is seen, suggesting that this approach is a suitable method for data integration and visualization of biological data and for correlation of biological parameters between the different data types, which could lead to an improvement in biological interpretation.
TL;DR: In this paper, a dynamic network module is defined as a set of proteins satisfying two conditions: (1) they form a connected component in the protein-protein interaction (PPI) network; and (2) their expression profiles form certain structures in the temporal domain.
Abstract: Despite the rapid accumulation of systems-level biological data, understanding the dynamic nature of cellular activity remains a difficult task. The reason is that most biological data are static, or only correspond to snapshots of cellular activity. In this study, we explicitly attempt to detangle the temporal complexity of biological networks by using compilations of time-series gene expression profiling data. We define a dynamic network module to be a set of proteins satisfying two conditions: (1) they form a connected component in the protein-protein interaction (PPI) network; and (2) their expression profiles form certain structures in the temporal domain. We develop an efficient mining algorithm to discover dynamic modules in a temporal network. Using yeast as a model system, we demonstrate that the majority of the identified dynamic modules are functionally homogeneous. Additionally, many of them provide insight into the sequential ordering of molecular events in cellular systems. Finally, we note that the applicability of our algorithm is not limited to the study of PPI networks, instead it is generally applicable to the combination of any type of network and time-series data.
TL;DR: Given the complex nature of experimental and biological data, it is necessary and feasible to complement PCA with methods that take into consideration the nonlinear properties of biological systems for a more robust estimation of their DOFs.
Abstract: Studies of the degrees of freedom and "synergies" in musculoskeletal systems rely critically on algorithms to estimate the "dimension" of kinematic or neural data. Linear algorithms such as principal component analysis (PCA) are the most popular. However, many biological data (or realistic experimental data) may be better represented by nonlinear sets than linear subspaces. We evaluate the performance of PCA and compare it to two nonlinear algorithms [Isomap and our novel pointwise dimension estimation (PD-E)] using synthetic and motion capture data from a robotic arm with known kinematic dimensions, as well as motion capture data from human hands. We find that PCA can lead to more accurate dimension estimates when considering additional properties of the PCA residuals, instead of the dominant method of using a threshold of variance captured. In contrast to the single integer dimension estimates of PCA and Isomap, PD-E provides a distribution and range of estimates of fractal dimension that identify the heterogeneous geometric structure in the experimental data. A strength of the PD-E method is that it associates a distribution of dimensions to the data. Since there is no a priori reason to assume that the sets of interest have a single dimension, these distributions incorporate more information than a single summary statistic. Our preliminary findings suggest that fewer than ten DOFs are involved in some hand motion tasks. Contrary to common opinion regarding fractal dimension methods, PD-E yielded reasonable results with reasonable amounts of data. Given the complex nature of experimental and biological data, we conclude that it is necessary and feasible to complement PCA with methods that take into consideration the nonlinear properties of biological systems for a more robust estimation of their DOFs.
TL;DR: In this article, a case study from northern Australia is used to demonstrate a methodology for defining uncertainties which result from the extrapolation of bio-physical associations across areas where detailed biological data is absent.
TL;DR: The correspondence of these results with other analyses of these data and the additional insights generated from variogram analysis and kriging illustrate the potential utility of geostatistical analysis in biological anthropology.
TL;DR: A unified methodology is presented which allows many common (statistical) response models to be fitted to such data sets and is capable of handling problems with millions of variables and a large variety of response types within the one framework.
Abstract: Background
With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.
TL;DR: The generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily is considered, using the distribution of amino acids in the sequences of proteins as features.
Abstract: This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the sequences. To generate the fuzzy rules, we have used some modified versions of a common approach. The generated rules are simple and understandable, especially for biologists. To evaluate our fuzzy classifiers, we have used four protein superfamilies from UniProt database. Experimental results show the comprehensibility of generated fuzzy rules with comparable classification accuracy. Bioinformatics(4)is basically conceptualizing biology in terms of macromolecules and applying informatics techniques to understand and organize the information associated with these molecules. It deals primarily with the application of computer and statistical techniques to the management of biological information. Because of the Human Genome Project and other similar efforts, a large number of biological data are regularly collected. It is important to organize and annotate this massive amount of sequential data to maximize its utility. In this regard, DNA sequences are translated into protein sequences using standard bioinformatics tools. Among these is protein sequence classification, which determines the type or group of proteins to which an unknown protein sequence belongs. One of the benefits from this type of category grouping is that molecular analysis can be carried out within a particular superfamily instead of an individual protein sequence. A protein superfamily consists of protein sequence members that are evolutionally related and therefore functionally and structurally relevant to each other. Several approaches dealing with the protein classification problem have been proposed in the past. These include alignment of protein sequences (2), hidden Markov modeling (14), application of artificial neural networks (23, 24, 25), using
TL;DR: An efficient parallel algorithm to find a MLCS for an arbitrary number of sequences, which is based on the dominant points method is introduced and found that the performance of the algorithm increases with the number of input sequences and reaches a near-linear speedup for eight sequences.
Abstract: Finding the multiple longest common subsequence (MLCS) is an important problem in the areas of bioinformatics and computational genomics. Approaches that are more efficient than the standard dynamic programming method have been introduced and successfully parallelized for the special cases of 2 sequences. However, the increasing complexity and size of biological data require an efficient method applicable to an arbitrary number of sequences as well as its efficient parallelization. A recently developed dominant points method for a general MLCS problem has been shown a significant performance improvement over the dynamic programming method, when number of sequences is larger than two. At the same time, the approach has revealed strong demand for its parallelization, in order to be applied to the larger families of sequences or sequences of the greater lengths. In this paper, we introduce an efficient parallel algorithm to find a MLCS for an arbitrary number of sequences, which is based on the dominant points method. When the number of processors is not greater than the size of alphabet multiplied by the number of sequences, the parallel algorithm is estimated to have the asymptotically linear speed up. We experimentally tested the algorithm using sets of randomly generated sequences over different alphabets as well as the protein sequences from a family of homologous proteins. We found that the performance of the algorithm increases with the number of input sequences and reaches a near-linear speedup for eight sequences.
TL;DR: The conceptual basis and example applications of the Modeling ENvironment for TOtal Risk (MENTOR), and the DOse–Response Information ANalysis system (DORIAN) are presented.
Abstract: Advances in computing processing power and in availability of environmental and biological data have allowed the development and application of comprehensive modeling systems that utilize a holistic, integrated, approach for assessing the interactions of environmental and biological systems across multiple scales of spatiotemporal extent and biological organization. This approach allows mechanism-based environmental health risk assessments in a person-oriented framework, which accounts for simultaneous exposures to contaminants from multiple media, routes, and pathways. The conceptual basis and example applications of the Modeling ENvironment for TOtal Risk (MENTOR), and the DOse–Response Information ANalysis system (DORIAN) are presented.
TL;DR: This paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions, and an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed.
Abstract: The discovery of information encoded in biological sequences is assuming a prominent role in identifying genetic diseases and in deciphering biological mechanisms. This information is usually encoded in patterns frequently occurring in the sequences, also called motifs. In fact, motif discovery has received much attention in the literature, and several algorithms have already been proposed, which are specifically tailored to deal with motifs exhibiting some kinds of "regular structure". Motivated by biological observations, this paper focuses on the mining of loosely structured motifs, i.e., of more general kinds of motif where several "exceptions" may be tolerated in pattern repetitions. To this end, an algorithm exploiting data structures conceived to efficiently handle pattern variabilities is presented and analyzed. Furthermore, a randomized variant with linear time and space complexity is introduced, and a theoretical guarantee on its performances is proven. Both algorithms have been implemented and tested on real data sets. Despite the ability of mining very complex kinds of pattern, performance results evidence a genome-wide applicability of the proposed techniques.
TL;DR: This paper highlights the importance of keeping provenance information about the links between data items from difierent sources, and proposes the use of named graphs to make a provenance statement about each pair of linked data items and each release of a data web.
Abstract: To created a linked data web of heterogeneous biological data resources, we need not only to deflne and create the alignment between related data resources but also to express the knowledge about why data items from difierent sources are linked with each other and how each data link has evolved, so that scientists can trust the data links provided by the data web. This paper highlights the importance of keeping provenance information about the links between data items from difierent sources, and proposes the use of named graphs to make a provenance statement about each pair of linked data items and each release of a data web.
TL;DR: The number of times a machine passes though each state is used as a numerical feature set for classification in this study, and side effect machines are found to be effective at recognizing classes of DNA sequence data.
Abstract: Finite state machines are routinely used to efficiently recognize patterns in strings. The internal state structure of the machine is typically only of peripheral interest, appearing in algorithms only when the number of states is minimized in the interests of efficiency of execution or comparison. A side effect machine saves information about the internal transitions of the state machine. This record of internal state transitions forms an induced feature set for any string run through the side effect machine. In this study the number of times a machine passes though each state is used as a numerical feature set for classification. Finite state machines are trained with an evolutionary algorithm to produce feature sets that are very easy for an unsupervised learning algorithm, k-means clustering, to learn. The system is demonstrated on synthetic and biological data. The biological data are PCR-primers classified by their success at amplification. The parameters, number of states, population size, and mutation rates are explored to characterize their effect on performance. Side effect machines are found to be effective at recognizing classes of DNA sequence data.
TL;DR: A selection of methods involving the generation and organization of data, including sequence data, RNA and protein structures, microarray expression data and functional annotations, methods for discovering the functional components of genomes, whether they be genes, alternative splice sites, non-coding RNAs or regulatory motifs, and several of the most interesting methods in phylogenetics and evolution are presented in this article.
Abstract: Not only is the quantity of life science data expanding, but new types of biological data continue to be introduced as a result of technological development and a growing understanding of biological systems. Methods for analyzing these data are an increasingly important component of modern biological research. In Bioinformatics, leading researchers in the field provide a selection of the most useful and widely applicable methods, able to be applied as is, or with minor variations, to many specific problems. Volume I: Data, Sequence Analysis and Evolution examines a selection of methods involving the generation and organization of data, including sequence data, RNA and protein structures, microarray expression data and functional annotations, methods for discovering the functional components of genomes, whether they be genes, alternative splice sites, non-coding RNAs or regulatory motifs, and several of the most interesting methods in phylogenetics and evolution.
TL;DR: This work presents a system to integrate data across multiple bioinformatics databases and enable mining across various conceptual levels of biological information, represented as complex networks.
Abstract: The emergence of systems biology necessitates development of platforms to organise and interpret plentitude of biological data. We present a system to integrate data across multiple bioinformatics databases and enable mining across various conceptual levels of biological information. The results are represented as complex networks. Context dependent mining of these networks is achieved by use of distances. Our approach is demonstrated with three applications: full metabolic network retrieval with network topology study, exploration of properties and relationships of a set of selected proteins, and combined visualisation and exploration of gene expression data with related pathways and ontologies.