TL;DR: The toolkit incorporates over 130 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization, and a new plotting engine developed to maximum their interactive ability.
TL;DR: Here I present an investigation of published data for technical controls in droplet-based scRNA-seq experiments that demonstrates that the number of zero values in the data is consistent with common distributional models of molecule sampling counts.
Abstract: Potential users of single cell RNA-sequencing often encounter a choice between high-throughput droplet based methods and high sensitivity plate based methods. In particular there is a widespread belief that single-cell RNA-sequencing will often fail to generate measurements for particular gene, cell pairs due to molecular inefficiencies, causing data to have an overabundance of zero-values. Investigation of published data of technical controls in droplet based single cell RNA-seq experiments demonstrates the number of zeros in the data is consistent with count statistics, indicating that over-abundances of zero-values in biological data are likely due to biological variation as opposed to technical shortcomings.
TL;DR: CNGB Sequence Archive (CNSA) is presented for archiving omics data, including raw sequencing data and its analytical data and related metadata which are organized into six objects, namely Project, Sample, Experiment, Run, Assembly, and Variation at present.
Abstract: With the application and development of high-throughput sequencing technology in life and health sciences, massive multi-dimensional biological data brings the problem of efficient management and utilization. Database development and biocuration are the prerequisites for the reuse of these big data. Here, relying on China National GeneBank (CNGB), we present CNGB Sequence Archive (CNSA) for archiving omics data, including raw sequencing data and its analytical data and related metadata which are organized into six objects, namely Project, Sample, Experiment, Run, Assembly, and Variation at present. Moreover, CNSA has created the correlation model of living samples, sample information, and analytical data on some projects, so that all data can be traced throughout the life cycle from the living sample to the sample information to the analytical data. Complying with the data standards commonly used in the life sciences, CNSA is committed to improving the data standards and building a comprehensive and curated data repository for the storage, management and sharing of omics data, and providing free access to open data resources for worldwide scientific communities to support academic research and the bio-industry. Database URL: https://db.cngb.org/cnsa/
TL;DR: The overall organization and procedures of SASBDB are described, paying most attention to user‐relevant information during submission, which is a comprehensive, free and searchable repository of SAS experimental data and models deposited together with the relevant experimental conditions, sample details and instrument characteristics.
Abstract: Small-angle scattering (SAS) of X-rays and neutrons is a fundamental tool to study the nanostructural properties, and in particular, biological macromolecules in solution. In structural biology, SAS recently transformed from a specialization into a general technique leading to a dramatic increase in the number of publications reporting structural models. The growing amount of data recorded and published has led to an urgent need for a global SAS repository that includes both primary data and models. In response to this, a small-angle scattering biological data bank (SASBDB) was designed in 2014 and is available for public access at www.sasbdb.org. SASBDB is a comprehensive, free and searchable repository of SAS experimental data and models deposited together with the relevant experimental conditions, sample details and instrument characteristics. SASBDB is rapidly growing, and presently has over 1,000 entries containing more than 1,600 models. We describe here the overall organization and procedures of SASBDB paying most attention to user-relevant information during submission. Perspectives of further developments, in particular, with OneDep system of the Protein Data Bank, and also widening of SASBDB including new types of data/models are discussed.
TL;DR: Advances in chemical synthesis and biological screening technologies over the past decade have shown great potential to provide insights into relevant chemical structures to predict in vitro, in vivo, and clinical outcomes, thereby advancing drug discovery and development in the big data era.
TL;DR: Comparative experiments demonstrated that the XGBoost method has a remarkable performance in predicting the stage of cancer patients with multi-omics data and identification of novel candidate genes associated with cancer stages would contribute to further elucidate disease pathogenesis and develop novel therapeutics.
TL;DR: This review focuses on the recent advances in developing novel computational methods for the inference of novel biological relations from the HMLN, and surveys four categories of state-of-the-art methods (matrix factorization, random walk, knowledge graph, and deep learning).
Abstract: Advances in next-generation sequencing and high-throughput techniques have enabled the generation of vast amounts of diverse omics data. These big data provide an unprecedented opportunity in biology, but impose great challenges in data integration, data mining, and knowledge discovery due to the complexity, heterogeneity, dynamics, uncertainty, and high-dimensionality inherited in the omics data. Network has been widely used to represent relations between entities in biological system, such as protein-protein interaction, gene regulation, and brain connectivity (i.e. network construction) as well as to infer novel relations given a reconstructed network (aka link prediction). Particularly, heterogeneous multi-layered network (HMLN) has proven successful in integrating diverse biological data for the representation of the hierarchy of biological system. The HMLN provides unparalleled opportunities but imposes new computational challenges on establishing causal genotype-phenotype associations and understanding environmental impact on organisms. In this review, we focus on the recent advances in developing novel computational methods for the inference of novel biological relations from the HMLN. We first discuss the properties of biological HMLN. Then we survey four categories of state-of-the-art methods (matrix factorization, random walk, knowledge graph, and deep learning). Thirdly, we demonstrate their applications to omics data integration and analysis. Finally, we outline strategies for future directions in the development of new HMLN models.
TL;DR: A large-scale, integrated modeling approach to simultaneously cross-evaluate millions of heterogeneous data against themselves, based on an extensive computer model of Escherichia coli that accounts for the function of 1214 genes, providing the most natural, intuitive interpretation of an integrated dataset.
Abstract: The extensive heterogeneity of biological data poses challenges to analysis and interpretation. Construction of a large-scale mechanistic model of Escherichia coli enabled us to integrate and cross-evaluate a massive, heterogeneous dataset based on measurements reported by various groups over decades. We identified inconsistencies with functional consequences across the data, including that the total output of the ribosomes and RNA polymerases described by data are not sufficient for a cell to reproduce measured doubling times, that measured metabolic parameters are neither fully compatible with each other nor with overall growth, and that essential proteins are absent during the cell cycle-and the cell is robust to this absence. Finally, considering these data as a whole leads to successful predictions of new experimental outcomes, in this case protein half-lives.
TL;DR: A new biological knowledge graph is proposed which provides a compilation of curated relational data from open biological databases in a unified format with common, interlinked identifiers.
Abstract: Knowledge graphs became a popular means for modeling complex biological systems where they model the interactions between biological entities and their effects on the biological system. They also provide support for relational learning models which are known to provide highly scalable and accurate predictions of associations between biological entities. Despite the success of the combination of biological knowledge graph and relation learning models in biological predictive tasks, there is a lack of unified biological knowledge graph resources. This forced all current efforts and studies for applying a relational learning model on biological data to compile and build biological knowledge graphs from open biological databases. This process is often performed inconsistently across such efforts, especially in terms of choosing the original resources, aligning identifiers of the different databases, and assessing the quality of included data. To make relational learning on biomedical data more standardised and reproducible, we propose a new biological knowledge graph which provides a compilation of curated relational data from open biological databases in a unified format with common, interlinked identifiers. We also provide a new module for mapping identifiers and labels from different databases which can be used to align our knowledge graph with biological data from other heterogeneous sources. Finally, to illustrate the practical relevance of our work, we provide a set of benchmarks based on the presented data that can be used to train and assess the relational learning models in various tasks related to pathway and drug discovery.
TL;DR: This work proposes a sparse neural encoder-decoder network to predict metabolite abundances from microbe abundances using paired data from a cohort of inflammatory bowel disease (IBD) patients and shows that the model outperforms linear univariate and multivariate methods in terms of accuracy, sparsity, and stability.
Abstract: Technological advances in next-generation sequencing (NGS) and chromatographic assays [e.g., liquid chromatography mass spectrometry (LC-MS)] have made it possible to identify thousands of microbe and metabolite species, and to measure their relative abundance. In this paper, we propose a sparse neural encoder-decoder network to predict metabolite abundances from microbe abundances. Using paired data from a cohort of inflammatory bowel disease (IBD) patients, we show that our neural encoder-decoder model outperforms linear univariate and multivariate methods in terms of accuracy, sparsity, and stability. Importantly, we show that our neural encoder-decoder model is not simply a black box designed to maximize predictive accuracy. Rather, the network’s hidden layer (i.e., the latent space, comprised only of sparsely weighted microbe counts) actually captures key microbe-metabolite relationships that are themselves clinically meaningful. Although this hidden layer is learned without any knowledge of the patient’s diagnosis, we show that the learned latent features are structured in a way that predicts IBD and treatment status with high accuracy. By imposing a non-negative weights constraint, the network becomes a directed graph where each downstream node is interpretable as the additive combination of the upstream nodes. Here, the middle layer comprises distinct microbe-metabolite axes that relate key microbial biomarkers with metabolite biomarkers. By pre-processing the microbiome and metabolome data using compositional data analysis methods, we ensure that our proposed multi-omics workflow will generalize to any pair of -omics data. To the best of our knowledge, this work is the first application of neural encoder-decoders for the interpretable integration of multi-omics biological data.
TL;DR: The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets.
Abstract: In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.
TL;DR: A novel feature selection algorithm, termed as “sigFeature”, based on SVM and t statistic, which is able to predict the signature of four out of six microarray data sets accurately, whereas the other algorithms predict less data set signatures are observed.
Abstract: Biological data are accumulating at a faster rate, but interpreting them still remains a problem. Classifying biological data into distinct groups is the first step in understanding them. Data classification in response to a certain treatment is an extremely important aspect for differentially expressed genes in making present/absent calls. Many feature selection algorithms have been developed including the support vector machine recursive feature elimination procedure (SVM-RFE) and its variants. Support vector machine RFEs are greedy methods that attempt to find superlative possible combinations leading to binary classification, which may not be biologically significant. To overcome this limitation of SVM-RFE, we propose a novel feature selection algorithm, termed as "sigFeature" (https://bioconductor.org/packages/sigFeature/), based on SVM and t statistic to discover the differentially significant features along with good performance in classification. The "sigFeature" R package is centered around a function called "sigFeature," which provides automatic selection of features for the binary classification. Using six publicly available microarray data sets (downloaded from Gene Expression Omnibus) with different biological attributes, we further compared the performance of "sigFeature" to three other feature selection algorithms. A small number of selected features (by "sigFeature") also show higher classification accuracy. For further downstream evaluation of its biological signature, we conducted gene set enrichment analysis with the selected features (genes) from "sigFeature" and compared it with the outputs of other algorithms. We observed that "sigFeature" is able to predict the signature of four out of six microarray data sets accurately, whereas the other algorithms predict less data set signatures. Thus, "sigFeature" is considerably better than related algorithms in discovering differentially significant features from microarray data sets.
TL;DR: It is demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data by integrating features derived from sequence data and protein-protein interaction network.
Abstract: Accurately predicting essential genes using computational methods can greatly reduce the effort in finding them via wet experiments at both time and resource scales, and further accelerate the process of drug discovery. Several computational methods have been proposed for predicting essential genes in model organisms by integrating multiple biological data sources either via centrality measures or machine learning based methods. However, the methods aiming to predict human essential genes are still limited and the performance still need improve. In addition, most of the machine learning based essential gene prediction methods are lack of skills to handle the imbalanced learning issue inherent in the essential gene prediction problem, which might be one factor affecting their performance. We propose a deep learning based method, DeepHE, to predict human essential genes by integrating features derived from sequence data and protein-protein interaction (PPI) network. A deep learning based network embedding method is utilized to automatically learn features from PPI network. In addition, 89 sequence features were derived from DNA sequence and protein sequence for each gene. These two types of features are integrated to train a multilayer neural network. A cost-sensitive technique is used to address the imbalanced learning problem when training the deep neural network. The experimental results for predicting human essential genes show that our proposed method, DeepHE, can accurately predict human gene essentiality with an average performance of AUC higher than 94%, the area under precision-recall curve (AP) higher than 90%, and the accuracy higher than 90%. We also compare DeepHE with several widely used traditional machine learning models (SVM, Naive Bayes, Random Forest, and Adaboost) using the same features and utilizing the same cost-sensitive technique to against the imbalanced learning issue. The experimental results show that DeepHE significantly outperforms the compared machine learning models. We have demonstrated that human essential genes can be accurately predicted by designing effective machine learning algorithm and integrating representative features captured from available biological data. The proposed deep learning framework is effective for such task.
TL;DR: This review explores how big data has been analyzed to identify the general patterns of biological responses to global changes at scales from gene to ecosystem, and investigates how observational networks and space‐based big data have facilitated the discovery of emergent mechanisms and phenomena.
Abstract: Global change biology has been entering a big data era due to the vast increase in availability of both environmental and biological data. Big data refers to large data volume, complex data sets, and multiple data sources. The recent use of such big data is improving our understanding of interactions between biological systems and global environmental changes. In this review, we first explore how big data has been analyzed to identify the general patterns of biological responses to global changes at scales from gene to ecosystem. After that, we investigate how observational networks and space-based big data have facilitated the discovery of emergent mechanisms and phenomena on the regional and global scales. Then, we evaluate the predictions of terrestrial biosphere under global changes by big modeling data. Finally, we introduce some methods to extract knowledge from big data, such as meta-analysis, machine learning, traceability analysis, and data assimilation. The big data has opened new research opportunities, especially for developing new data-driven theories for improving biological predictions in Earth system models, tracing global change impacts across different organismic levels, and constructing cyberinfrastructure tools to accelerate the pace of model-data integrations. These efforts will uncork the bottleneck of using big data to understand biological responses and adaptations to future global changes.
TL;DR: The pyBOAT tool as discussed by the authors is a Python-based fully automatic stand-alone software that integrates multiple steps of non-stationary oscillatory time series analysis into an easy-to-use graphical user interface.
Abstract: Methods for the quantification of rhythmic biological signals have been essential for the discovery of function and design of biological oscillators. Advances in live measurements have allowed recordings of unprecedented resolution revealing a new world of complex heterogeneous oscillations with multiple noisy non-stationary features. However, our understanding of the underlying mechanisms regulating these oscillations has been lagging behind, partially due to the lack of simple tools to reliably quantify these complex non-stationary features. With this challenge in mind, we have developed pyBOAT, a Python-based fully automatic stand-alone software that integrates multiple steps of non-stationary oscillatory time series analysis into an easy-to-use graphical user interface. pyBOAT implements continuous wavelet analysis which is specifically designed to reveal time- dependent features. In this work we illustrate the advantages of our tool by analyzing complex non-stationary time-series profiles. Our approach integrates data-visualization, optimized sinc-filter detrending, amplitude envelope removal and a subsequent continuous- wavelet based time-frequency analysis. Finally, using analytical considerations and numerical simulations we discuss unexpected pitfalls in commonly used smoothing and detrending operations.
TL;DR: An exhaustive review of methodological alternatives for coupling landscape graphs with biological data reveals a dichotomy between approaches aimed at improving ecological knowledge and approaches with an operational aim, in which biological data are directly integrated into the graph construction and assuming a positive effect of connectivity.
Abstract: Landscape graphs are widely used to model networks of habitat patches. As they require little input data, they are particularly suitable for supporting conservation decisions (and decisions about other issues as e.g. disease spread) taken by land planners. However, it may be problematic to use these methods in operational contexts without validating them with empirical data on species or communities. Since little is known about methodological alternatives for coupling landscape graphs with biological data, we have made an exhaustive review of these methods to analyze links between the main purposes of the studies, the way landscape graphs are constructed and used, the type of field data, and the way these data are integrated into the analysis. We systematically describe a corpus of 71 scientific papers dealing with terrestrial species, with particular emphasis on methodological choices and contexts of the studies. Despite a great variability of types of biological data and coupling strategies, our analyses reveal a dichotomy according to the objective of the studies, between (i) approaches aimed at improving ecological knowledge, mainly based on land-cover maps and using biological data to test the influence of landscape connectivity on biological responses, and (ii) approaches with an operational aim, in which biological data are directly integrated into the graph construction and assuming a positive effect of connectivity. Beyond these main contrasts, the review shows that landscape graphs can benefit from field data of different types at varying scales. The great variability of approaches adopted reveals the flexible nature of these tools.
TL;DR: This chapter is composed of four sections which highlights the different machine learning techniques in bioinformatics and analyzes the various research areas related to bio informatics that can be explored by the academicians and researchers.
Abstract: Machine learning (ML) deals with the automated learning of machines without being programmed explicitly. It focuses on performing data-based predictions and has several applications in the field of bioinformatics. Bioinformatics involves the processing of biological data using approaches based on computation and mathematics. The biological data has grown exponentially in recent times leading to two issues. One issue is of efficient information storage and the second issue deals with how useful knowledge can be mined from the data. The second issue can be solved using machine learning which can generate knowledge from data that is heterogeneous in nature. The feature learning is enabled automatically by deep learning which represents a machine learning technique. New set of features are constructed by combining multiple features based on the dataset. This approach enables algorithms to perform complex predictions on large datasets. ML is currently being applied in six key subfields of bioinformatics such as microarrays, evolution, systems biology, genomics, text mining, and proteomics. This chapter is composed of four sections. The first section will provide an outline of ML in bioinformatics. This is followed by the second section which highlights the different machine learning techniques in bioinformatics. The third section describes two case studies using artificial neural network in bioinformatics. The fourth section analyzes the various research areas related to bioinformatics that can be explored by the academicians and researchers. The conclusion of the chapter is presented in the end.
TL;DR: This work proposes a variant of principal component analysis, sparse contrastive principal components analysis, that extracts sparse, stable, interpretable, and relevant biological signal.
Abstract: Motivation Statistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances. However, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously. Results Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis (PCA), sparse contrastive PCA that extracts sparse, stable, interpretable and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study and via analyses of several publicly available protein expression, microarray gene expression and single-cell transcriptome sequencing datasets. Availability and implementation A free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in this article is also available via GitHub. Contact philippe_boileau@berkeley.edu. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: This review discusses the challenges related to storage, transfer, access and analysis of unstructured and structured biological big data and discusses the future perspectives of big data bioinformatics.
TL;DR: Three pattern matching algorithms that are specially formulated to speed up searches on large DNA sequences are introduced and the experimental results demonstrate the superiority of the presented algorithms over the other simulated algorithms.
Abstract: The remarkable growth of biological data is a motivation to accelerate the discovery of solutions in many domains of computational bioinformatics. In different phases of the computational pipelines, pattern matching is a very practical operation. For example, pattern matching enables users to find the locations of particular DNA subsequences in a database or DNA sequence. Furthermore, in these expanding biological databases, some patterns are updated over time. To perform faster searches, high-speed pattern matching algorithms are needed. The present paper introduces three pattern matching algorithms that are specially formulated to speed up searches on large DNA sequences. The proposed algorithms raise performance by utilizing word processing (in place of the character processing presented in previous works) and also by searching the least frequent word of the pattern in the sequence. In terms of time cost, the experimental results demonstrate the superiority of the presented algorithms over the other simulated algorithms.
TL;DR: Focusing on the use of DL to analyse patterns in data from diverse biological domains, different DL architectures’ applications to these data are investigated and some open research challenges in using DL to mine biological data are outlined and a number of possible future perspectives are put forward.
Abstract: Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Broadly categorized in three types (i.e., sequences, images, and signals), these data are huge in amount and complex in nature. Mining such an enormous amount of data for pattern recognition is a big challenge and requires sophisticated data-intensive machine learning techniques. Artificial neural network-based learning systems are well known for their pattern recognition capabilities and lately their deep architectures - known as deep learning (DL) - have been successfully applied to solve many complex pattern recognition problems. Highlighting the role of DL in recognizing patterns in biological data, this article provides - applications of DL to biological sequences, images, and signals data; overview of open access sources of these data; description of open source DL tools applicable on these data; and comparison of these tools from qualitative and quantitative perspectives. At the end, it outlines some open research challenges in mining biological data and puts forward a number of possible future perspectives.
TL;DR: This study presents the first feasibility study of the extraction of local and large-scale biological phenomena from the Chinese weather radar network data using the advanced deep learning method.
Abstract: Continental coverage and year-round operation of the weather radar networks provide an unprecedented opportunity for studying large-scale airborne migration. The broad and local-scale airborne information collected by these infrastructures can answer many ecological questions. However, extracting and interpreting the biological information from such massive weather radar data remains an intractable problem. Recently, many big-data problems have been solved using the deep learning technology. In this study, the biological information in the weather radar data is identified using the advanced deep learning method. The proposed method consists of two main parts, i.e., a rendering and casting procedure and an image segmentation procedure based on a convolutional neural network. The biological data are automatically extracted by rendering and mapping, image segmentation, and result masking. By analyzing the typical radar data from single and multiple stations, we partly reveal the intensity and speed of the migration pattern. We present the first feasibility study of the extraction of local and large-scale biological phenomena from the Chinese weather radar network data.
TL;DR: Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions that is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types and outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks.
Abstract: High-throughput biological data analysis commonly involves the identification of "interesting" features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions, from numerous features measured simultaneously. To ensure the reliability of such analysis, the most widely-used criterion is the false discovery rate (FDR), the expected proportion of uninteresting features among the identified ones. Existing bioinformatics tools primarily control the FDR based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions, two requirements that are often unmet in biological studies. To address this issue, we propose Clipper, a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper is applicable to identifying both enriched and differential features from high-throughput biological data of diverse types. In comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including peak calling from ChIP-seq data, differentially expressed gene identification from RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. Notably, our benchmarking results for peptide identification are based on the first mass spectrometry data standard that has a realistic dynamic range. Our results demonstrate Clipper9s flexibility and reliability for FDR control, as well as its broad applications in high-throughput data analysis.
TL;DR: The Feature-Viewer is a lightweight library for the visualization of biological data mapped to a protein or nucleotide sequence designed for ease of use while allowing for a full customization.
Abstract: SUMMARY The Feature-Viewer is a lightweight library for the visualization of biological data mapped to a protein or nucleotide sequence. It is designed for ease of use while allowing for a full customization. The library is already used by several biological data resources and allows intuitive visual mapping of a full spectra of sequence features for different usages. AVAILABILITY AND IMPLEMENTATION The Feature-Viewer is open source, compatible with state-of-the-art development technologies and responsive, also for mobile viewing. Documentation and usage examples are available online.
TL;DR: BCCGD, a method for integrating subcellular localization data with PPINs that detects breast cancer candidate genes in protein complexes is presented and shows that BCCGD discover some novel breast cancers candidate genes which are valuable references for the biomedical scientists.
Abstract: Due to technological advances the quality and availability of biological data has increased dramatically in the last decade. Analysing protein-protein interaction networks (PPINs) in an integrated way, together with subcellular compartment data, provides such biological context, helps to fill in the gaps between a single type of biological data and genes causing diseases and can identify novel genes related to disease. In this study, we present BCCGD, a method for integrating subcellular localization data with PPINs that detects breast cancer candidate genes in protein complexes. We achieve this by defining the significance of the compartment, constructing edge-weighted PPINs, finding protein complexes with a non-negative matrix factorization approach, generating disease-specific networks based on the known disease genes, prioritizing disease candidate genes with a WDC method. As a case study, we investigate the breast cancer but the techniques described here are applicable to other disorders. For the top genes scored by BCCGD approach, we utilize the literature retrieving method to test the correlations of them with the breast cancer. The results show that BCCGD discover some novel breast cancer candidate genes which are valuable references for the biomedical scientists.
TL;DR: The multi-constrained joint non-negative matrix factorization (MCJNMF) was developed for simultaneous integration of genomic data and image data to identify common modules related to disease and successfully performed on the module identification of lung metastasis in soft tissue sarcomas (STSs).
Abstract: Objective: The study of pathogenic mechanism at the genetic level by imaging genetics methods enables to effectively reveal the association of histopathology and genetics. However, there is a lack of effective and accurate tools to establish association models from macroscopic to microscopic. Methods : The multi-constrained joint non-negative matrix factorization (MCJNMF) was developed for simultaneous integration of genomic data and image data to identify common modules related to disease. Two types of data matrices were projected onto a common feature space, in which heterogeneous variables with large coefficients in the same projected direction form a common module. Meanwhile, the correlation between original data features was integrated by using regularization constraints to improve the biological relevance. Sparsity constraints and orthogonal constraints were performed on decomposition factors to minimize the redundancy between different bases and to reduce algorithm complexity. Results: This algorithm was successfully performed on the module identification of lung metastasis in soft tissue sarcomas (STSs) by integrating FDG-PET image and DNA methylation data features. Multilevel analysis on the top extracted modules revealed that these modules were closely related to the lung metastasis. Particularly, several genes with diagnostic potential for lung metastasis can be discovered from high score modules. Conclusion: This method not only can be applied for the accurate identification of patterns related to pathogenic mechanism of diseases, but also has a significant implication for discovering protein biomarkers. Significance: This method provides avenues for further studies of identifying complex association patterns of diseases according to different types of biological data.
TL;DR: A novel Three-layer heterogeneous network Combined with unbalanced Random Walk for MiRNA-Disease Association prediction algorithm (TCRWMDA), which can effectively integrate multi-source association data and proved that TCRWMDA is an effective tool to predict the potential miRNA-disease association.
Abstract: miRNA plays an important role in many biological processes, and increasing evidence shows that miRNAs are closely related to human diseases. Most existing miRNA-disease association prediction methods were only based on data related to miRNAs and diseases and failed to effectively use other existing biological data. However, experimentally verified miRNA-disease associations are limited, there are complex correlations between biological data. Therefore, we propose a novel Three-layer heterogeneous network Combined with unbalanced Random Walk for MiRNA-Disease Association prediction algorithm (TCRWMDA), which can effectively integrate multi-source association data. TCRWMDA based not only on the known miRNA-disease associations, also add the new priori information (lncRNA-miRNA and lncRNA-disease associations) to build a three-layer heterogeneous network, lncRNA was added as the transition path of the intermediate point to mine more effective information between networks. The AUC value obtained by the TCRWMDA algorithm on 5-fold cross validation is 0.9209, compared with other models based on the same similarity calculation method, TCRWMDA obtained better results. TCRWMDA was applied to the analysis of four types of cancer, the results proved that TCRWMDA is an effective tool to predict the potential miRNA-disease association. The source code and dataset of TCRWMDA are available at: https://github.com/ylm0505/TCRWMDA.
TL;DR: An intelligent multi-agent-based performance optimization approach that blends state-of-the-art probabilistic, recurrent neural network and long short term memory models to intelligently predict the upcoming behavior and optimization needs of the system.
TL;DR: Using statistically derived bioregions provides a repeatable framework for the spatial representation of biodiversity at multiple spatial scales and results in better-informed management decisions and biodiversity conservation outcomes.
Abstract: Bioregions are important tools for understanding and managing natural resources. Bioregions should describe locations of relatively homogenous assemblages of species occur, enabling managers to better regulate activities that might affect these assemblages. Many existing bioregionalization approaches, which rely on expert-derived, Delphic comparisons or environmental surrogates, do not explicitly include observed biological data in such analyses. We highlight that, for bioregionalizations to be useful and reliable for systems scientists and managers, the bioregionalizations need to be based on biological data; to include an easily understood assessment of uncertainty, preferably in a spatial format matching the bioregions; and to be scientifically transparent and reproducible. Statistical models provide a scientifically robust, transparent, and interpretable approach for ensuring that bioregions are formed on the basis of observed biological and physical data. Using statistically derived bioregions provides a repeatable framework for the spatial representation of biodiversity at multiple spatial scales. This results in better-informed management decisions and biodiversity conservation outcomes.
TL;DR: An algorithm and a stand-alone application for adaptive filtering insignificant features in high-throughput data, based on Gaussian mixture decomposition, and the algorithm increases sensitivity of methods used to find differentially expressed features and biological validity of the findings is presented.
Abstract: Data filtering based on removing non-informative features, with unchanged signal between compared experimental conditions, can significantly increase sensitivity of methods used to detect differentially expressed genes or other molecular components measured in high-throughput biological experiments. Criteria for data filtering can be stated on the basis of averages or variances of signal levels across samples. The crucial parts of feature filtering are selection of filter type and cut-off threshold, which are specific to the particular dataset. In this paper, we present an algorithm and a stand-alone application, GaMRed, for adaptive filtering insignificant features in high-throughput data, based on Gaussian mixture decomposition. We have tested the performance of our algorithm using datasets from three different high-throughput biological experiments. We estimated the number of differentially expressed features after applying multiple testing correction and performed functional analysis of obtained features using Gene Ontology terms. Also, we checked if the control of false discovery rate and family-wise error rate after applying feature filtering remains at appropriate level. GaMRed is fast, automatic, and does not require expert knowledge in parameter tuning. The algorithm increases sensitivity of methods used to find differentially expressed features and biological validity of the findings. The program can be downloaded from: http://zaed.aei.polsl.pl/index.php/pl/oprogramowanie-zaed .