TL;DR: The technical challenges to integration, classifies the approaches, and critiques the available tools and methodologies are surveyed, to counter the increasing dispersion and heterogeneity of data.
Abstract: Scientific data of importance to biologists reside in a number of different data sources, such as GenBank, GSDB, SWISS-PROT, EMBL, and OMIM, among many others. Some of these data sources are conventional databases implemented using database management systems (DBMSs) and others are structured files maintained in a number of different formats (e.g., ASN.1 and ACE). In addition, software packages such as sequence analysis packages (e.g., BLAST and FASTA) produce data and can therefore be viewed as data sources. To counter the increasing dispersion and heterogeneity of data, different approaches to integrating these data sources are appearing throughout the bioinformatics community. This paper surveys the technical challenges to integration, classifies the approaches, and critiques the available tools and methodologies.
TL;DR: Significantly improved results in terms of performance, quality of results, and the scaling of applicability have been realized through non-standard and even counter-intuitive parameter settings.
Abstract: Applying genetic algorithms to DNA sequence assembly is not a straightforward process. Significantly improved results in terms of performance, quality of results, and the scaling of applicability have been realized through non-standard and even counter-intuitive parameter settings. Specifically, the solution time for a 10kb data set was reduced by an order of magnitude, and a 20kb data set that was previously unsolved by the genetic algorithm was solved in a time that represents only a linear increase from the 10kb data set. Additionally, significant progress has been made on a 35kb data set representing real biological data. A single contig solution was found for a 752 fragment subset of the data set, and a 15 contig solution was found for the full data set. This paper discusses the new results, the modifications to the previous genetic algorithm used in this study, the experimental design process by which the new results were obtained, the questions raised by these results, and some preliminaxy attempts to explain these results.
TL;DR: The analysis of a B. subtilis genome fragment allowed us to combine the results of several methods used for predicting coding sequences, and to characterize it as comprising a cryptic phage, the skin element, indicated that local features of the nucleotide sequence could discriminate between phage and non-phage DNA sequence.
TL;DR: A Bayesian approach for determining the coordinates of atoms in a three-dimensional space is developed and how to extend the algorithm to make it suitable for non-Gaussian constraints is described.
Abstract: Modeling the structure of biological molecules is critical for understanding how these structures perform their function, and for designing compounds to modify or enhance this function (for medicinal or industrial purposes). The determination of molecular structure involves defining three-dimensional positions for each of the constituent atoms using a variety of experimental, theoretical and empirical data sources. Unfortunately, each of these data sources can be noisy or not available in sufficient abundance to determine the precise position of each atom. Instead, some atomic positions are precisely defined by the data, and others are poorly defined. An understanding of structural uncertainty is critical for properly interpreting structural models. We have developed a Bayesian approach for determining the coordinates of atoms in a three-dimensional space. Our algorithm takes as input a set of probabilistic constraints on the coordinates of the atoms, and an a priori distribution for each atom location. The output is a maximum a posteriori (MAP) estimate of the location of each atom. We introduce constraints as updates to the prior distributions. In this paper, we describe the algorithm and show its performance on three data sets. The first data set is synthetic and illustrates the convergence properties of the method. The other data sets comprise real biological data for a protein (the trp repressor molecule) and a nucleic acid (the transfer RNA fold). Finally, we describe how we have begun to extend the algorithm to make it suitable for non-Gaussian constraints.
TL;DR: The prototype of a software system that provides an environment for the analysis of large-scale sequence data and an overview of the knowledge-based models used to build this integrated system is presented.
Abstract: Analysis of the huge volumes of data generated by large scale sequencing projects clearly requires the construction of new sophisticated computer systems. These systems should be able to handle the biological data as well as the results of the analysis of this data. They should also help the user to choose the most appropriate method for a simple task and to string together the methods needed to solve a global analysis task. In this paper we present the prototype of a software system that provides an environment for the analysis of large-scale sequence data. In a first approach this environment has been put to the test within the B. subtilis sequencing projecL This system integrates both a descriptive knowledge of the entities involved (genes, regulatory signals etc.) and the methodological knowledge concerning an extendable set of analytical methods (i.e. how to solve sequence analysis problem through task decomposition and method selection). A knowledge representation based on two existing object-orianted models, named Shirka and SCARP, is used to implement this integrated system. In addition, the present prototype provides a suitable user interface for both displaying the results generated by several methods and interscting with the objects. We present in rids paper an overview of the knowledge-based models used to build this integrated system, and a description of the way in which biological entities and sequence analysis tasks are represented. We give illustrations of the co-operation between user and system during the problem solving process. Such a system constitutes a computer workbench for molecular biologists studying the genetic programs of living organisms.
TL;DR: It is concluded that object-oriented technology will support semantic checking even in a complex domain like biology, and proposes 10 guidelines for future work including ways of treating exceptional cases and 'positioning' of constraints in a schema.
Abstract: Issues critical to ensuring semantic integrity in molecular biological data collections have been identified and include complexity, exceptions, missing data, changing models, holism and integration, delocalized data, interoperability and nomenclature. This combination is peculiar to biology and presents some interesting problems as a result. Little is known about semantic checking in object-oriented databases in general, but because such technology appears highly suitable for modeling biological data, it is appropriate to examine the ways in which object-oriented technology can support this functionality. It is concluded that object-oriented technology will support semantic checking even in a complex domain like biology. We propose 10 guidelines for future work including ways of treating exceptional cases and 'positioning' of constraints in a schema.
TL;DR: The reader is introduced to the general properties of DNA and the currently used methods for manipulating and studying DNA molecules and how DNA properties allow parallel strategies to be used for direct biological experiments, and on possible future applications which go beyond pure biological applications.
Abstract: Potential analogies exist between the way computers process and analyze data and the way data is handled in biological systems. Biological data is predominantly in the form of DNA. DNA data is at least two-fold redundant, and it is often multi-fold redundant. Thus it is relatively error resistant. A key aspect of DNA data is that the sequence of DNA bases which is the information stored in DNA also provides a way for the specific purification of DNA subsets. Thus, DNA, in principle, can be handled as very complex mixtures of species with the ability to sort things out afterwards. As a result, DNA-based manipulations can sometimes be formulated into highly parallel strategies. In this paper we will introduce the reader to the general properties of DNA and the currently used methods for manipulating and studying DNA molecules. We will emphasize how DNA properties allow parallel strategies to be used for direct biological experiments, and we will also speculate on possible future applications of such strategies which go beyond pure biological applications.
TL;DR: The biology of adult body size, its behavior as a variable in statistical analyses, and strategies for the incorporation of this variable into demographic models of population aging in the United States are discussed.
Abstract: The accuracy of demographic models designed to project future trends of population-level health and disease can be improved by incorporating biological data. One barrier to this process are quantitative characteristics of the data themselves. Biological data are characteristically time-dependent phenomena that behave in a nonlinear fashion. To develop accurate projections of the morbidity, disability, and mortality experience among future cohorts in late life, research needs to focus on development of models that create the opportunity to distinguish all-or-none, boundaries, and latency aspects of biological factors driving demographic phenomena, development of methods to identify time-dependent effects, and development of genetically informative samples. This presentation focuses on the biology of adult body size, its behavior as a variable in statistical analyses, and strategies for the incorporation of this variable into demographic models of population aging in the United States. First, several examples of generally observed quantitative characteristics of biological variables are reviewed. To illustrate the nonlinear character of biological data, three general patterns of change with aging are presented. Next, issues concerning the measurement of body size are discussed. Scenarios describing body size over the adult life span are described. By the end of this process, recommendations for starting a dialogue between researchers interested in biological endpoints (individual weight change, disease risk) and those interested in demographic outcomes (population-level disease and disability issues) using body size will be presented.
TL;DR: Many systems in nature, including biological systems, have very complex dynamics which generate random-looking time series, and it is often of interest to determine whether the system is caused by deterministic subsystems (e.g. chaotic systems), stochastic subsystems, or both.
Abstract: Many systems in nature, including biological systems, have very complex dynamics which generate random-looking time series. To better understand a particular dynamical system, it is often of interest to determine whether the system is caused by deterministic subsystems (e.g. chaotic systems), stochastic subsystems, or both. Although there are now several different approaches to determine this from time series data (e.g. correlation dimension and Lyapunov exponent calculations), these methods often require large amounts of stationary data (biological data is frequently nonstationary for long time scales), can often mis-identify certain systems, and can be subject to other technical problems. Alternatively, one can use methods that measure the complexity in a particular system which seldom make assumptions about a particular system, such as assuming the presence of stationarity. Additionally, mathematical and computational modeling techniques can be used to test different hypothesis about the dynamics of biological systems.
TL;DR: The spatial indexing features of the Illustra(tm) object-relational database management system are linked with the visualization capabilities of AVS to create an interactive environment for analysis of SEA data.
Abstract: As part of a large effort evaluating the effect of the Exxon Valdez oil spill, we are using the spatial selection features of an object-relational database management system to support the visualization of the ecological data. The effort, called the Sound Ecosystem Assessment project (SEA), is collecting and analyzing oceanographic and biological data from Prince William Sound in Alaska. To support visualization of the SEA data we are building a data management system which includes a spatial index over a bounding polygon for all of the datasets which are collected. In addition to other selection criteria the prototype provides several methods for selecting data within an arbitrary region. This case study presents the requirements and the implementation for the application prototype which combines visualization and database technology. The spatial indexing features of the Illustra(tm) object-relational database management system are linked with the visualization capabilities of AVS to create an interactive environment for analysis of SEA data.
TL;DR: The biology of adult body size, its behavior as a variable in statistical analyses, and strategies for the incorporation of this variable into demographic models of population aging in the United States are discussed.
Abstract: The accuracy of demographic models designed to project future trends of population- level health and disease can be improved by incorporating biological data. One barrier to this process are quantitative characteristics of the data themselves. Biological data are characteristi- cally time-dependent phenomena that behave in a nonlinear fashion. To develop accurate projections of the morbidity, disability, and mortality experience among future cohorts in late life, research needs to focus on development of models that create the opportunity to distinguish all-or-none, boundaries, and latency aspects of biological factors driving demographic phenom- ena, development of methods to identify time-dependent effects, and development of genetically informative samples. This presentation focuses on the biology of adult body size, its behavior as a variable in statistical analyses, and strategies for the incorporation of this variable into demographic models of population aging in the United States. First, several examples of gen- erally observed quantitative characteristics of biological variables are reviewed. To illustrate the nonlinear character of biological data, three general patterns of change with aging are presented. Next, issues concerning the measurement of body size are discussed. Scenarios describing body size over the adult life span are described. By the end of this process, recommendations for starting a dialogue between researchers interested in biological endpoints (individual weight change, disease risk) and those interested in demographic outcomes (population-level disease and disability issues) using body size will be presented.
TL;DR: Techniques for querying and transforming scientific data in structured files maintained in a number of different formats are presented and their use in a prototype system developed in conjunction with the Human Genome Center for Chromosome 22.
Abstract: Scientific data of importance to biologists in the Humitn Genome Project resides not only in conventional da.tabases, but in structured files maintained in a number of different formats (e.g. ASN.1 a.nd ACE) as well a.s sequence analysis packages (e.g. BLAST and FASTA). These formats and packages contain a number of data types not found in conventional databases, such as lists and variants, and may be deeply nested. We present in this paper techniques for querying and transforming such data, and illustrate their use in a prototype system developed in conjunction with the Human Genome Center for Chromosome 22. We also describe optimizations performed by the system, a crucial issue for bulk data.