TL;DR: This chapter focuses on software tools that assist in visual exploration and analysis of biological networks, exemplified using the widely used Cytoscape tool.
Abstract: Substantial progress has been made in the field of "omics" research (eg, Genomics, Transcriptomics, Proteomics, and Metabolomics), leading to a vast amount of biological data In order to represent large biological data sets in an easily interpretable manner, this information is frequently visualized as graphs, ie, a set of nodes and edges Nodes are representations of biological molecules and edges connect the nodes depicting some kind of relationship Obviously, there is a high demand for computer-based assistance for both visualization and analysis of biological data, which are often heterogeneous and retrieved from different sources This chapter focuses on software tools that assist in visual exploration and analysis of biological networks Global requirements for such programs are discussed Utilization of visualization software is exemplified using the widely used Cytoscape tool Additional information about the use of Cytoscape is provided in the Notes section Furthermore, special features of alternative software tools are highlighted in order to assist researchers in the choice of an adequate program for their specific requirements
TL;DR: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework and has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets.
Abstract: Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits. A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework. sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
TL;DR: A method is introduced that extracts information from high-throughput microarray data and, by using topology, provides greater depth of information than current analytic techniques and identifies a unique subgroup of Estrogen Receptor-positive breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes.
Abstract: High-throughput biological data, whether generated as sequencing, transcriptional microarrays, proteomic, or other means, continues to require analytic methods that address its high dimensional aspects. Because the computational part of data analysis ultimately identifies shape characteristics in the organization of data sets, the mathematics of shape recognition in high dimensions continues to be a crucial part of data analysis. This article introduces a method that extracts information from high-throughput microarray data and, by using topology, provides greater depth of information than current analytic techniques. The method, termed Progression Analysis of Disease (PAD), first identifies robust aspects of cluster analysis, then goes deeper to find a multitude of biologically meaningful shape characteristics in these data. Additionally, because PAD incorporates a visualization tool, it provides a simple picture or graph that can be used to further explore these data. Although PAD can be applied to a wide range of high-throughput data types, it is used here as an example to analyze breast cancer transcriptional data. This identified a unique subgroup of Estrogen Receptor-positive (ER+) breast cancers that express high levels of c-MYB and low levels of innate inflammatory genes. These patients exhibit 100% survival and no metastasis. No supervised step beyond distinction between tumor and healthy patients was used to identify this subtype. The group has a clear and distinct, statistically significant molecular signature, it highlights coherent biology but is invisible to cluster methods, and does not fit into the accepted classification of Luminal A/B, Normal-like subtypes of ER+ breast cancers. We denote the group as c-MYB+ breast cancer.
TL;DR: The Cytoscape plugin clusterMaker as mentioned in this paper provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function.
Abstract: In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.
TL;DR: The BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases and proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment.
Abstract: Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems.
The BioMart project (www.biomart.org) was initiated to address these challenges. The BioMart software is based on two fundamental concepts: data agnostic modelling and data federation. Data agnostic modelling simplifies the difficult and time-consuming task of data modelling. In BioMart, this is achieved by using a predefined, query-optimized relational schema that can be used to represent any kind of data (1). Data federation makes it possible to organize multiple, disparate and distributed database systems into what appears to be a single integrated virtual database. It therefore allows users to access and cross reference data from these data sources using a single user interface, without the need for database administrators to physically collate the data in one location.
Using these fundamental concepts, the BioMart project has driven a change in the biological data management paradigm, where individual biological databases are managed by different custom built systems. To give more control to both the users and the data providers, a new, innovative solution was required. BioMart started by adapting data warehousing ideas to create one universal software system for biological data management and empower biologists with the ability to create complex, customized datasets through a web interface without the need for bioinformatics support (1). It subsequently introduced a new innovative way of creating large multi-database repositories that avoid the need to store all the data in a single location (2), and finally it proved that large-scale projects involving next generation sequencing data can be managed efficiently in a distributed environment (3).
BioMart has successfully adapted data warehousing ideas such as data marts, dimensional modelling (4), and query optimization into the world of biological databases (5–13). BioMart's ability to quickly deploy a website hosting any type of data, user-friendly graphical user interface, several programmatic interfaces and support for third party tools contributed to its success and adoption by many different types of projects around the world as their data management platform (14). During the 10 years of its existence, BioMart has grown from humble beginnings as a ‘data mining extension’ for the Ensembl website (1), to become an international collaboration involving large number of different organizations located on five continents: Asia, Australia, Europe, North America and South America (3,15). It has a large community of users and developers and it has been successfully used in both academia and industry. The latest version of the BioMart software has been significantly enhanced with numerous graphical user interfaces that are tailored to different user groups. In addition, it has been further improved by parallel query processing, it is now extensible with different analysis tools and the installation process can be effortlessly completed with just a few mouse clicks (16).
Building on the wealth of information that has become accessible through the BioMart interface, the BioMart Central Portal (15) has introduced an innovative alternative to the large data stores maintained by specialized organizations such as The European Bioinformatics Institute (EBI) or The National Center for Biotechnology Information (NCBI). BioMart Central Portal is a first-of-its-kind, community-driven effort to provide unified access to dozens of biological databases. All development and maintenance of individual databases is left to the individual data providers, making it a very cost-effective approach. The groups that maintain individual sources can do so at their own location without the necessity of any data exchange procedures. In addition, they can draw on the wealth of information available through the portal to expose their data in the context of third party annotations. The BioMart Central Portal approach is very democratic: everyone can join or remove their data source at any time. BioMart Central Portal is effectively a ‘Virtual Bioinformatics Institute’ with no onsite personnel, minimal administration, and a very ‘green’ footprint.
More recently, the International Cancer Genome Consortium (ICGC) Data Portal has demonstrated how BioMart can scale to manage large collaborative projects involving next generation sequencing data (3). The ICGC is generating data on an unprecedented scale by sequencing 500 cancer genomes and matched normal control genomes for 50 different cancer types (17). The effort is distributed between multiple participating countries and sequencing centres. Given the scale of the effort, moving all of the data to a single location is impractical. Instead, the ICGC Data Portal relies on BioMart data federation. By replicating and distributing the data model across different centres that produce the same type of data according to the same recipe, the scalability of the effort is greatly improved. Each centre is only responsible for managing their own data while data access to all of the consortium data is managed by the BioMart software. This presents a scalable approach, not only in the traditional sense of parallelizing data processing and storage, but also in a more general sense of outsourcing the external annotation expertise by federating annotations from additional, independently-maintained databases that are available in the BioMart Central Portal.
The future developments for BioMart involve specialized ‘pre-packaged’ and reusable data portals. One example already in development is the OncoPortal, aimed at researchers managing cancer data. It will include preconfigured access to sources of annotations that are useful for cancer research such as Ensembl (5), Reactome (12), COSMIC (9), Pancreatic Expression Database (10) and others. It will also include a set of tools that are specifically designed for cancer data analysis. There are plans to build other preconfigured portals for different research areas, such as a mouse portal and a model organism portal. It is an ambition of the BioMart community that the BioMart project remains at the forefront of innovative solutions for biological data management in the years to come. By creating these specialized solutions and further reducing the barriers to entry, the aim is to encourage more groups to share their data through BioMart, thereby further enhancing the entire BioMart community.
TL;DR: Gitools is presented, an open-source tool to perform analyses and visualize data and results as interactive heat-maps, which facilitate the integration of novel data with previous knowledge in genomics.
Abstract: Intuitive visualization of data and results is very important in genomics, especially when many conditions are to be analyzed and compared. Heat-maps have proven very useful for the representation of biological data. Here we present Gitools (http://www.gitools.org), an open-source tool to perform analyses and visualize data and results as interactive heat-maps. Gitools contains data import systems from several sources (i.e. IntOGen, Biomart, KEGG, Gene Ontology), which facilitate the integration of novel data with previous knowledge.
TL;DR: Thallinger et al. as discussed by the authors presented a novel approach for quantification of lipids in LC-MS data, which obtains its analytical power by two major innovations: (i) a 3D algorithm that confines the peak borders in m/z and time direction and (ii) the use of the theoretical isotopic distribution of an analyte as selection/exclusion criterion.
Abstract: Motivation: The accurate measurement of the lipidome permits insights into physiological and pathological processes. Of the present high-throughput technologies, LC-MS especially bears potential of monitoring quantitative changes in hundreds of lipids simultaneously. In order to extract valuable information from huge amount of mass spectrometry data, the aid of automated, reliable, highly sensitive and specific analysis algorithms is indispensable.
Results: We present here a novel approach for the quantitation of lipids in LC-MS data. The new algorithm obtains its analytical power by two major innovations: (i) a 3D algorithm that confines the peak borders in m/z and time direction and (ii) the use of the theoretical isotopic distribution of an analyte as selection/exclusion criterion. The algorithm is integrated in the Lipid Data Analyzer (LDA) application which additionally provides standardization, a statistics module for results analysis, a batch mode for unattended analysis of several runs and a 3D viewer for the manual verification. The statistics module offers sample grouping, tests between sample groups and export functionalities, where the results are visualized by heat maps and bar charts. The presented algorithm has been applied to data from a controlled experiment and to biological data, containing analytes distributed over an intensity range of 106. Our approach shows improved sensitivity and an extremely high positive predictive value compared with existing methods. Consequently, the novel algorithm, integrated in a user-friendly application, is a valuable improvement in the high-throughput analysis of the lipidome.
Implementation and availability: The Java application is freely available for non-commercial users at http://genome.tugraz.at/lda. Raw data associated with this manuscript may be downloaded from ProteomeCommons.org Tranche using the following hash: ZBh3nS5bXk6I/Vn32tB5Vh0qnMpVIW71HByFFQqM0RmdF4/4Hcn H3Wggh9kU2teYVOtM1JWwHIeMHqSS/bc2yYNFmyUAAAAAAACl DQ ==
Contact: Gerhard.Thallinger@tugraz.at
Supplementary information:Supplementary data are available from Bioinformatics online.
TL;DR: This review presents an overview of selected software tools for preprocessing LC–MS based metabolomics data and tries to provide future directions.
TL;DR: openBIS is an open source software framework for constructing user-friendly, scalable and powerful information systems for data and metadata acquired in biological experiments that can be extended and customized for different data types acquired by a range of technologies.
Abstract: Modern data generation techniques used in distributed systems biology research projects often create datasets of enormous size and diversity. We argue that in order to overcome the challenge of managing those large quantitative datasets and maximise the biological information extracted from them, a sound information system is required. Ease of integration with data analysis pipelines and other computational tools is a key requirement for it. We have developed openBIS, an open source software framework for constructing user-friendly, scalable and powerful information systems for data and metadata acquired in biological experiments. openBIS enables users to collect, integrate, share, publish data and to connect to data processing pipelines. This framework can be extended and has been customized for different data types acquired by a range of technologies. openBIS is currently being used by several SystemsX.ch and EU projects applying mass spectrometric measurements of metabolites and proteins, High Content Screening, or Next Generation Sequencing technologies. The attributes that make it interesting to a large research community involved in systems biology projects include versatility, simplicity in deployment, scalability to very large data, flexibility to handle any biological data type and extensibility to the needs of any research domain.
TL;DR: In this paper, a nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred.
Abstract: A nonparametric Bayesian extension of Factor Analysis (FA) is proposed where observed data Y is modeled as a linear superposition, G, of a potentially infinite number of hidden factors, X. The Indian Buffet Process (IBP) is used as a prior on G to incorporate sparsity and to allow the number of latent features to be inferred. The model’s utility for modeling gene expression data is investigated using randomly generated data sets based on a known sparse connectivity matrix for E. Coli, and on three biological data sets of increasing complexity.
TL;DR: An adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types is described.
Abstract: Whereas genomic data are universally machine-readable, data from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.
TL;DR: Angular histogram and attribute curves offer an intuitive way for the user to explore the clustering, linear correlations and outliers in large data sets without the over-plotting and clutter problems associated with traditional parallel coordinates.
Abstract: Parallel coordinates is a popular and well-known multivariate data visualization technique. However, one of their inherent limitations has to do with the rendering of very large data sets. This often causes an overplotting problem and the goal of the visual information seeking mantra is hampered because of a cluttered overview and non-interactive update rates. In this paper, we propose two novel solutions, namely, angular histograms and attribute curves. These techniques are frequency-based approaches to large, high-dimensional data visualization. They are able to convey both the density of underlying polylines and their slopes. Angular histogram and attribute curves offer an intuitive way for the user to explore the clustering, linear correlations and outliers in large data sets without the over-plotting and clutter problems associated with traditional parallel coordinates. We demonstrate the results on a wide variety of data sets including real-world, high-dimensional biological data. Finally, we compare our methods with the other popular frequency-based algorithms.
TL;DR: Five algorithms are developed for efficient detection of biological network motifs, and several evaluation measures including motifs included in complex, motifs including in functional module and GO term clustering score are introduced in this paper.
Abstract: Background: Molecular level of biological data can be constructed into system level of data as biological networks. Network motifs are defined as over-represented small connected subgraphs in networks and they have been used for many biological applications. Since network motif discovery involves computationally challenging processes, previous algorithms have focused on computational efficiency. However, we believe that the biological quality of network motifs is also very important. Results: We define biological network motifs as biologically significant subgraphs and traditional network motifs are differentiated as structural network motifs in this paper. We develop five algorithms, namely, EDGEGO-BNM, EDGEBETWEENNESS-BNM, NMF-BNM, NMFGO-BNM and VOLTAGE-BNM, for efficient detection of biological network motifs, and introduce several evaluation measures including motifs included in complex, motifs included in functional module and GO term clustering score in this paper. Experimental results show that EDGEGO-BNM and EDGEBETWEENNESS-BNM perform better than existing algorithms and all of our algorithms are applicable to find structural network motifs as well. Conclusion: We provide new approaches to finding network motifs in biological networks. Our algorithms efficiently detect biological network motifs and further improve existing algorithms to find high quality structural network motifs, which would be impossible using existing algorithms. The performances of the algorithms are compared based on our new evaluation measures in biological contexts. We believe that our work gives some guidelines of network motifs research for the biological networks.
TL;DR: The ePlant framework is a suite of open-source world wide web-based tools for the visualization of large-scale data sets from the model organism Arabidopsis thaliana, and can be applied to any model organism.
Abstract: Visualization tools for biological data are often limited in their ability to interactively integrate data at multiple scales. These computational tools are also typically limited by two-dimensional displays and programmatic implementations that require separate configurations for each of the user's computing devices and recompilation for functional expansion. Towards overcoming these limitations we have developed “ePlant” (http://bar.utoronto.ca/eplant) – a suite of open-source world wide web-based tools for the visualization of large-scale data sets from the model organism Arabidopsis thaliana. These tools display data spanning multiple biological scales on interactive three-dimensional models. Currently, ePlant consists of the following modules: a sequence conservation explorer that includes homology relationships and single nucleotide polymorphism data, a protein structure model explorer, a molecular interaction network explorer, a gene product subcellular localization explorer, and a gene expression pattern explorer. The ePlant's protein structure explorer module represents experimentally determined and theoretical structures covering >70% of the Arabidopsis proteome. The ePlant framework is accessed entirely through a web browser, and is therefore platform-independent. It can be applied to any model organism. To facilitate the development of three-dimensional displays of biological data on the world wide web we have established the “3D Data Display Initiative” (http://3ddi.org).
TL;DR: The SEEK is promoted as a data and model management tool that can be adapted to the specific needs of a particular systems biology project and the methods employed for lowering the barriers to adoption of standard formats are presented.
Abstract: Systems biology research is typically performed by multidisciplinary groups of scientists, often in large consortia and in distributed locations. The data generated in these projects tend to be heterogeneous and often involves high-throughput “omics” analyses. Models are developed iteratively from data generated in the projects and from the literature. Consequently, there is a growing requirement for exchanging experimental data, mathematical models, and scientific protocols between consortium members and a necessity to record and share the outcomes of experiments and the links between data and models. The overall output of a research consortium is also a valuable commodity in its own right. The research and associated data and models should eventually be available to the whole community for reuse and future analysis. The SEEK is an open-source, Web-based platform designed for the management and exchange of systems biology data and models. The SEEK was originally developed for the SysMO (systems biology of microorganisms) consortia, but the principles and objectives are applicable to any systems biology project. The SEEK provides an index of consortium resources and acts as gateway to other tools and services commonly used in the community. For example, the model simulation tool, JWS Online, has been integrated into the SEEK, and a plug-in to PubMed allows publications to be linked to supporting data and author profiles in the SEEK. The SEEK is a pragmatic solution to data management which encourages, but does not force, researchers to share and disseminate their data to community standard formats. It provides tools to assist with management and annotation as well as incentives and added value for following these recommendations. Data exchange and reuse rely on sufficient annotation, consistent metadata descriptions, and the use of standard exchange formats for models, data, and the experiments they are derived from. In this chapter, we present the SEEK platform, its functionalities, and the methods employed for lowering the barriers to adoption of standard formats. As the production of biological data continues to grow, in systems biology and in the life sciences in general, the need to record, manage, and exploit this wealth of information in the future is increasing. We promote the SEEK as a data and model management tool that can be adapted to the specific needs of a particular systems biology project.
TL;DR: In this paper, a combination of biological, sedimentary and acoustic data was used to analyze and map near-shore benthic biotopes in the 2-30m depth range.
Abstract: Near-shore benthic biotopes, in the 2–30 m depth range, were analyzed and mapped using a combination of biological, sedimentary and acoustic data to interrogate the utility of these methods. The surveyed area covered approximately 270 km 2 , along 80 km of the Southeast coast of Portugal. The acoustic data were acquired with a single beam ground discrimination system (QTC VIEW, Series V), connected to a dual frequency echo sounder (50 and 200 kHz). Sediment grain size and macrofaunal community data were obtained by grab sampling at 88 sites. The sedimentary and the biological data were subjected to classification and ordination analysis and the acoustic data obtained with the two frequencies were analyzed individually with the QTC IMPACT (v3.4) software and classified into acoustic classes. The affinity groups obtained with the three datasets were mapped using a Geographic Information System. The study area showed an inshore-offshore pattern with sands ranging from very fine to very coarse, revealing a sediment distribution in the near-shore shelf comprised by two main areas; a very fine/fine sand area up to 7–10 m depth and a very coarse/coarse sand area offshore. In turn, the biological data also showed an inshore-offshore pattern, identifying two main biological assemblages: fine/very fine sand and coarse/very coarse sand communities. Furthermore, the biological pattern clearly reflected the hydrodynamic conditions of the area, from highly hydrodynamic shallow depth to the calmer deeper areas. Accordingly, species richness and densities were lower at shallow areas than at deeper depth. The acoustic data identified the prevailing biosedimentary gradients along the inshore-offshore direction. Overall, the results obtained here showed that the acoustic system could identify two main areas based on sediment grain size, which closely related to the two main biological communities (groups A and B). This study indicates the importance of combining several layers of information in order to increase the spatial resolution of the main biotope distribution (validating the acoustic data) and detail their sub-divisions (ground-truth sampling). Furthermore, we show that designing ground-truth samples on the basis of the acoustic diversity data allows the biological data to “speak” for itself. In general, acoustic techniques are used as a means to optimize ground-truth sampling. This will only be valid if acoustics can capture all the essence of biotope heterogeneity. If this is not the case, special attention should be given when using broad scale methods (acoustic remote sensing) devoted to biotope mapping.
TL;DR: The confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.
Abstract: Motivation: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.
Results: In this article, we present a Support Vector Machine classifier that can correct the prediction for observed confounding factors. This is achieved by minimizing the statistical dependence between the classifier and the confounding factors. We prove that this formulation can be transformed into a standard Support Vector Machine with rescaled input data. In our experiments, our confounder correcting SVM (ccSVM) improves tumor diagnosis based on samples from different labs, tuberculosis diagnosis in patients of varying age, ethnicity and gender, and phenotype prediction in the presence of population structure and outperforms state-of-the-art methods in terms of prediction accuracy.
Availability: A ccSVM-implementation in MATLAB is available from http://webdav.tuebingen.mpg.de/u/karsten/Forschung/ISMB11_ccSVM/.
Contact: ed.gpm.negnibeut@il.nimil; ed.gpm.negnibeut@tdrawgrob.netsrak
TL;DR: This paper proposes a heuristic algorithm that can find planted (l, d)-signals in a given set of DNA sequences and demonstrates that the proposed algorithm outperforms current widely used motif finding algorithms.
Abstract: The planted (l, d)-motif search problem is a mathematical abstraction of the DNA functional site discovery task. In this paper, we propose a heuristic algorithm that can find planted (l, d)-signals in a given set of DNA sequences. Evaluations on simulated data sets demonstrate that the proposed algorithm outperforms current widely used motif finding algorithms. We also report the results of experiments on real biological data sets.
TL;DR: The experimental results demonstrate that the ensemble of SVMs created by the sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approachessuch as bagging and boosting.
Abstract: Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting.
TL;DR: Information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches are introduced and applied to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage.
Abstract: The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins.
Availability
Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.
TL;DR: This work has created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering.
Abstract: Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology (‘MCAM’) employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.
TL;DR: These model-based approaches to leveraging the increasing volume of data generated in the course of patient care and medical research suggest a huge opportunity to improve the understanding of biological systems involved in pharmacogenomics and apply them to questions of medical relevance.
Abstract: Pharmacogenomic investigation from both genome-wide association studies and experiments focused on candidate loci involved in drug mechanism and metabolism has yielded a substantial and increasing list of robust genetic effects on drug therapy in humans. At the same time, reasonably comprehensive molecular data such as gene expression, proteomic and metabolomic data are now available for collections of hundreds to thousands of individuals. If these data are structured in a statistically robust and computationally tractable way, such as a network model, they can aid in the analysis of new pharmacogenomics studies by suggesting novel hypotheses for the regulation of genes involved in drug metabolism and response. Similarly, hypotheses taken from these same models can direct genome-wide association studies by focusing the genome-wide association studies analysis on a number of specific hypotheses informed by the relationships customarily seen between a gene's expression or protein activity and genetic variation at a particular locus. Network models based on other sorts of systematic biological data such as cell-based surveys of drug effect on gene expression and mining of literature and electronic medical records for associations between clinical and molecular phenotypes also promise similar utility. Although surely primitive in comparison with what will be developed, these model-based approaches to leveraging the increasing volume of data generated in the course of patient care and medical research nevertheless suggest a huge opportunity to improve our understanding of biological systems involved in pharmacogenomics and apply them to questions of medical relevance.
TL;DR: As the influx of biological data is swelling faster than the available computer capacity, scientists need to find new ways of sustainable data management.
TL;DR: The robustness of RegnANN, its ability to discover second order correlations and the agreement between results obtained with this new methods on both synthetic and biological data are promising and they stimulate its application to a wider range of problems.
Abstract: RegnANN is a novel method for reverse engineering gene networks based on an ensemble of multilayer perceptrons. The algorithm builds a regressor for each gene in the network, estimating its neighborhood independently. The overall network is obtained by joining all the neighborhoods. RegnANN makes no assumptions about the nature of the relationships between the variables, potentially capturing high-order and non linear dependencies between expression patterns. The evaluation focuses on synthetic data mimicking plausible submodules of larger networks and on biological data consisting of submodules of Escherichia coli. We consider Barabasi and Erdos-Renyi topologies together with two methods for data generation. We verify the effect of factors such as network size and amount of data to the accuracy of the inference algorithm. The accuracy scores obtained with RegnANN is methodically compared with the performance of three reference algorithms: ARACNE, CLR and KELLER. Our evaluation indicates that RegnANN compares favorably with the inference methods tested. The robustness of RegnANN, its ability to discover second order correlations and the agreement between results obtained with this new methods on both synthetic and biological data are promising and they stimulate its application to a wider range of problems.
TL;DR: A novel approach to create a data-derived state-and-transition model that shows significant promise for simplifying management definitions of ecological condition and, via scenario analyses, can be used to assist in manager decision-making of large, complex aquatic ecosystems in the future.
TL;DR: In this article, a system and method configured to monitor biological data is presented. The system includes a biosensor assembly for processing biological data of a living body, which consists of a sensor configured to be coupled to a wearable item, for instance a cap worn on the head of a user.
Abstract: A system and method configured to monitor biological data. The system includes a biosensor assembly for processing biological data of a living body. The biosensor assembly comprises a sensor configured to be coupled to a wearable item, for instance a cap worn on the head of a user. The sensor is positioned on the wearable item to gather raw biological data from the body of the user, when the wearable item is worn on the body of the user. The assembly further comprises a controlling unit configured to interpret raw biological data received from the sensor and interpret raw biological data by computing a value representative of a physiological condition of the body of the user. The controlling unit compares the representative value to a threshold value stored in memory to characterize the physiological condition of the user and alerts the user.
TL;DR: PhyleasProg is a user-friendly web server as a turnkey tool dedicated to evolutionary analyses, which can help biologists with little experience in evolutionary methodologies by analysing their data in a simple and robust way, using methods corresponding to robust standards.
Abstract: Evolutionary analyses of biological data are becoming a prerequisite in many fields of biology. At a time of high-throughput data analysis, phylogenetics is often a necessary complementary tool for biologists to understand, compare and identify the functions of sequences. But available bioinformatics tools are frequently not easy for non-specialists to use. We developed PhyleasProg (http://phyleasprog.inra.fr), a user-friendly web server as a turnkey tool dedicated to evolutionary analyses. PhyleasProg can help biologists with little experience in evolutionary methodologies by analysing their data in a simple and robust way, using methods corresponding to robust standards. Via a very intuitive web interface, users only need to enter a list of Ensembl protein IDs and a list of species as inputs. After dynamic computations, users have access to phylogenetic trees, positive/purifying selection data (on site and branch-site models), with a display of these results on the protein sequence and on a 3D structure model, and the synteny environment of related genes. This connection between different domains of phylogenetics opens the way to new biological analyses for the discovery of the function and structure of proteins.
TL;DR: This article proposes a 2-step procedure for identifying pathways that are related to and influence the clinical phenotype, and proposes a nonlinear dimension reduction method, which permits flexible within-pathway gene interactions as well as nonlinear pathway effects on the response.
Abstract: S UMMARY In the analysis of high-throughput biological data, it is often believed that the biological units such as genes behave interactively by groups, that is, pathways in our context. It is conceivable that utilization of priorly available pathway knowledge would greatly facilitate both interpretation and estimation in statistical analysis of such high-dimensional biological data. In this article, we propose a 2-step procedure for the purpose of identifying pathways that are related to and influence the clinical phenotype. In the first step, a nonlinear dimension reduction method is proposed, which permits flexible within-pathway gene interactions as well as nonlinear pathway effects on the response. In the second step, a regularized model-based pathway ranking and selection procedure is developed that is built upon the summary features extracted from the first step. Simulations suggest that the new method performs favorably compared to the existing solutions. An analysis of a glioblastoma microarray data finds 4 pathways that have evidence of support from the biological literature.
TL;DR: The R package QuACN is under ongoing development and it can be used to answer intriguing research questions in network biology, e.g., classifying biological data or identifying meaningful biological features, by analyzing the topology of biological networks.
Abstract: Structural measures for networks have been extensively developed, but many of them have not yet demonstrated their sustainably. That means, it remains often unclear whether a particular measure is useful and feasible to solve a particular problem in network biology. Exemplarily, the classification of complex biological networks can be named, for which structural measures are used leading to a minimal classification error. Hence, there is a strong need to provide freely available software packages to calculate and demonstrate the appropriate usage of structural graph measures in network biology. Here, we discuss topological network descriptors that are implemented in the R-package QuACN and demonstrate their behavior and characteristics by applying them to a set of example graphs. Moreover, we show a representative application to illustrate their capabilities for classifying biological networks. In particular, we infer gene regulatory networks from microarray data and classify them by methods provided by QuACN. Note that QuACN is the first freely available software written in R containing a large number of structural graph measures. The R package QuACN is under ongoing development and we add promising groups of topological network descriptors continuously. The package can be used to answer intriguing research questions in network biology, e.g., classifying biological data or identifying meaningful biological features, by analyzing the topology of biological networks.
TL;DR: A data structure, semantically typed data hypercubes (SDCubes), that combines hierarchical data format version 5 (HDF5) and extensible markup language (XML) file formats, now permits the flexible storage, annotation and retrieval of large and heterogenous datasets.
Abstract: With vast increases in biological data generation, mechanisms for data storage and analysis have become limiting. A data structure, semantically typed data hypercubes (SDCubes), that combines hierarchical data format version 5 (HDF5) and extensible markup language (XML) file formats, now permits the flexible storage, annotation and retrieval of large and heterogenous datasets.