TL;DR: An r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees, which can read more tree file formats than other softwares, and support visualization of phylo, multiphylo, phylo4, phyla4d, obkdata and phyloseq tree objects defined in other r packages.
Abstract: Summary
We present an r package, ggtree, which provides programmable visualization and annotation of phylogenetic trees.
ggtree can read more tree file formats than other softwares, including newick, nexus, NHX, phylip and jplace formats, and support visualization of phylo, multiphylo, phylo4, phylo4d, obkdata and phyloseq tree objects defined in other r packages. It can also extract the tree/branch/node-specific and other data from the analysis outputs of beast, epa, hyphy, paml, phylodog, pplacer, r8s, raxml and revbayes software, and allows using these data to annotate the tree.
The package allows colouring and annotation of a tree by numerical/categorical node attributes, manipulating a tree by rotating, collapsing and zooming out clades, highlighting user selected clades or operational taxonomic units and exploration of a large tree by zooming into a selected portion.
A two-dimensional tree can be drawn by scaling the tree width based on an attribute of the nodes. A tree can be annotated with an associated numerical matrix (as a heat map), multiple sequence alignment, subplots or silhouette images.
The package ggtree is released under the artistic-2.0 license. The source code and documents are freely available through bioconductor (http://www.bioconductor.org/packages/ggtree).
TL;DR: The R/Bioconductor package scater is developed to facilitate rigorous pre‐processing, quality control, normalization and visualization of scRNA‐seq data and provides a convenient, flexible workflow to process raw sequencing reads into a high‐quality expression dataset ready for downstream analysis.
Abstract: Single-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalization.We have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalization and visualization of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development.The open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater .davis@ebi.ac.uk.Supplementary data are available at Bioinformatics online.
TL;DR: A significantly updated and improved version of the Bioconductor package ChAMP, which can be used to analyze EPIC and 450k data and many enhanced functionalities have been added, including correction for cell‐type heterogeneity, network analysis and a series of interactive graphical user interfaces.
Abstract: Summary: The Illumina Infinium HumanMethylationEPIC BeadChip is the new platform for high-throughput DNA methylation analysis, effectively doubling the coverage compared to the older 450 K array. Here we present a significantly updated and improved version of the Bioconductor package ChAMP, which can be used to analyze EPIC and 450k data. Many enhanced functionalities have been added, including correction for cell-type heterogeneity, network analysis and a series of interactive graphical user interfaces. / Availability and implementation: ChAMP is a BioC package available from https://bioconductor.org/packages/release/bioc/html/ChAMP.html. / Contact: a.teschendorff@ucl.ac.uk or s.beck@ucl.ac.uk or a.feber@ucl.ac.uk / Supplementary information: Supplementary data are available at Bioinformatics online.
TL;DR: The Splatter Bioconductor package is presented for simple, reproducible, and well-documented simulation of scRNA-seq data and provides an interface to multiple simulation methods including Splatter, the authors' own simulation, based on a gamma-Poisson distribution.
Abstract: As single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed, and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available. Here, we present the Splatter Bioconductor package for simple, reproducible, and well-documented simulation of scRNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types, or differentiation paths.
TL;DR: KaryoploteR as mentioned in this paper is an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them, which allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation.
Abstract: Motivation Data visualization is a crucial tool for data exploration, analysis and interpretation. For the visualization of genomic data there lacks a tool to create customizable non-circular plots of whole genomes from any species. Results We have developed karyoploteR, an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them. Plot creation process is inspired in R base graphics, with a main function creating karyoplots with no data and multiple additional functions, including custom functions written by the end-user, adding data and other graphical elements. This approach allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation. Availability and implementation karyoploteR is released under Artistic-2.0 License. Source code and documentation are freely available through Bioconductor (http://www.bioconductor.org/packages/karyoploteR) and at the examples and tutorial page at https://bernatgel.github.io/karyoploter_tutorial. Contact bgel@igtp.cat.
TL;DR: A web application is implemented that uses key functions of R‐package SynergyFinder, and provides not only the flexibility of using multiple synergy scoring models, but also a user‐friendly interface for visualizing the drug combination landscapes in an interactive manner.
Abstract: Summary Rational design of drug combinations has become a promising strategy to tackle the drug sensitivity and resistance problem in cancer treatment. To systematically evaluate the pre-clinical significance of pairwise drug combinations, functional screening assays that probe combination effects in a dose-response matrix assay are commonly used. To facilitate the analysis of such drug combination experiments, we implemented a web application that uses key functions of R-package SynergyFinder, and provides not only the flexibility of using multiple synergy scoring models, but also a user-friendly interface for visualizing the drug combination landscapes in an interactive manner. Availability and implementation The SynergyFinder web application is freely accessible at https://synergyfinder.fimm.fi ; The R-package and its source-code are freely available at http://bioconductor.org/packages/release/bioc/html/synergyfinder.html . Contact jing.tang@helsinki.fi.
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, which aims to provide real-time information about the human microbiome and its role in disease and disease progression.
Abstract: We present curatedMetagenomicData, a Bioconductor and command-line resource providing thousands of metagenomic profiles from the Human Microbiome Project and other publicly available datasets, and ExperimentHub, for convenient cloud-based distribution of data to the R desktop. curatedMetagenomicData provides standardized per-participant metadata linked to bacterial, fungal, archaeal, and viral taxonomic abundances, as well as quantitative metabolic functional profiles, generated by the HUMAnN2 and MetaPhlAn2 pipelines. The resulting datasets can be immediately analyzed with a wide range of statistical methods, requiring a minimum of bioinformatic expertise and no preprocessing of data. We demonstrate exploratory data analysis, an investigation of gut "enterotypes", and a comparison of the accuracy of disease classification from different data types. These documented analyses can be reproduced efficiently on a laptop, without the barriers of working with large-scale, raw sequencing data. The development of curatedMetagenomicData will continue with the addition, curation, and analysis of further microbiome datasets.
TL;DR: The annotatr Bioconductor package is developed to flexibly and quickly summarize and plot annotations of genomic regions, giving a better understanding of the genomic context of the regions.
Abstract: Motivation Analysis of next-generation sequencing data often results in a list of genomic regions. These may include differentially methylated CpGs/regions, transcription factor binding sites, interacting chromatin regions, or GWAS-associated SNPs, among others. A common analysis step is to annotate such genomic regions to genomic annotations (promoters, exons, enhancers, etc.). Existing tools are limited by a lack of annotation sources and flexible options, the time it takes to annotate regions, an artificial one-to-one region-to-annotation mapping, a lack of visualization options to easily summarize data, or some combination thereof. Results We developed the annotatr Bioconductor package to flexibly and quickly summarize and plot annotations of genomic regions. The annotatr package reports all intersections of regions and annotations, giving a better understanding of the genomic context of the regions. A variety of graphics functions are implemented to easily plot numerical or categorical data associated with the regions across the annotations, and across annotation intersections, providing insight into how characteristics of the regions differ across the annotations. We demonstrate that annotatr is up to 27× faster than comparable R packages. Overall, annotatr enables a richer biological interpretation of experiments. Availability and implementation http://bioconductor.org/packages/annotatr/ and https://github.com/rcavalcante/annotatr. Contact rcavalca@umich.edu. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: An updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages is presented, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled.
Abstract: High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).
TL;DR: DAPAR and ProStaR are software tools to perform the statistical analysis of label-free XIC-based quantitative discovery proteomics experiments and contain procedures to filter, normalize, impute missing value, aggregate peptide intensities, perform null hypothesis significance tests and select the most likely differentially abundant proteins with a corresponding false discovery rate.
Abstract: DAPAR and ProStaR are software tools to perform the statistical analysis of label-free XIC-based quantitative discovery proteomics experiments. DAPAR contains procedures to filter, normalize, impute missing value, aggregate peptide intensities, perform null hypothesis significance tests and select the most likely differentially abundant proteins with a corresponding false discovery rate. ProStaR is a graphical user interface that allows friendly access to the DAPAR functionalities through a web browser. AVAILABILITY AND IMPLEMENTATION DAPAR and ProStaR are implemented in the R language and are available on the website of the Bioconductor project (http://www.bioconductor.org/). A complete tutorial and a toy dataset are accompanying the packages. CONTACT samuel.wieczorek@cea.fr, florence.combes@cea.fr, thomas.burger@cea.fr.
TL;DR: CancerSubtypes is an R package for identifying cancer subtypes using multi‐omics data, including gene expression, miRNA expression and DNA methylation data that provides a standardized framework for data pre‐processing, feature selection, and result follow‐up analyses.
Abstract: Summary Identifying molecular cancer subtypes from multi-omics data is an important step in the personalized medicine. We introduce CancerSubtypes, an R package for identifying cancer subtypes using multi-omics data, including gene expression, miRNA expression and DNA methylation data. CancerSubtypes integrates four main computational methods which are highly cited for cancer subtype identification and provides a standardized framework for data pre-processing, feature selection, and result follow-up analyses, including results computing, biology validation and visualization. The input and output of each step in the framework are packaged in the same data format, making it convenience to compare different methods. The package is useful for inferring cancer subtypes from an input genomic dataset, comparing the predictions from different well-known methods and testing new subtype discovery methods, as shown with different application scenarios in the Supplementary Material. Availability and implementation The package is implemented in R and available under GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/CancerSubtypes/). Contact thuc.le@unisa.edu.au or jiuyong.li@unisa.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: The open‐source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands, and extends popular plots found in the limma package, to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points.
Abstract: Motivation graphics for RNA-sequencing and microarray gene expression analyses may contain upwards of tens of thousands of points. Details about certain genes or samples of interest are easily obscured in such dense summary displays. Incorporating interactivity into summary plots would enable additional information to be displayed on demand and facilitate intuitive data exploration. Results The open-source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands. It extends popular plots found in the limma package, such as multi-dimensional scaling plots and mean-difference plots, to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points. It also offers links between plots so that more information can be revealed on demand. Glimma is widely applicable, supporting data analyses from a number of well-established Bioconductor workflows ( limma , edgeR and DESeq2 ) and uses D3/JavaScript to produce HTML pages with interactive displays that enable more effective data exploration by end-users. Results from Glimma can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility. Availability and implementation The Glimma R package is available from http://bioconductor.org/packages/Glimma/ . Contact su.s@wehi.edu.au , law@wehi.edu.au or mritchie@wehi.edu.au.
TL;DR: A newly developed Bioconductor package for identifying potential quadruplex‐forming sequences (PQS), which allows for sequence searches that accommodate possible divergences from the optimal G4 base composition and demonstrates that the algorithm behind the searches has a 96% accuracy.
Abstract: Motivation: G-quadruplexes (G4s) are one of the non-B DNA structures easily observed in vitro and assumed to form in vivo. The latest experiments with G4-specific antibodies and G4-unwinding helicase mutants confirm this conjecture. These four-stranded structures have also been shown to influence a range of molecular processes in cells. As G4s are intensively studied, it is often desirable to screen DNA sequences and pinpoint the precise locations where they might form. Results: We describe and have tested a newly-developed Bioconductor package for identifying potential quadruplex-forming sequences (PQS). The package is easy-to-use, flexible and customizable. It allows for sequence searches that accommodate possible divergences from the optimal G4 base composition. A novel aspect of our research was the creation and training (parametrization) of an advanced scoring model which resulted in increased precision compared to similar tools. We demonstrate that the algorithm behind the searches has a 96% accuracy on 392 currently known and experimentally observed G4 structures. We also carried out searches against the recent G4-seq data to verify how well we can identify the structures detected by that technology. The correlation with pqsfinder predictionswas 0.622, higher than the correlation 0.491 obtained with the second best G4Hunter. Availability:http://bioconductor.org/packages/pqsfinder/ This paper is based on pqsfinder-1.4.1.
TL;DR: A Bioconductor R package for performing ROTS analysis conveniently on different types of omics data is introduced, and three case studies, involving proteomics and RNA-seq data from public repositories, are presented.
Abstract: Differential expression analysis is one of the most common types of analyses performed on various biological data (eg RNA-seq or mass spectrometry proteomics) It is the process that detects features, such as genes or proteins, showing statistically significant differences between the sample groups under comparison A major challenge in the analysis is the choice of an appropriate test statistic, as different statistics have been shown to perform well in different datasets To this end, the reproducibility-optimized test statistic (ROTS) adjusts a modified t-statistic according to the inherent properties of the data and provides a ranking of the features based on their statistical evidence for differential expression between two groups ROTS has already been successfully applied in a range of different studies from transcriptomics to proteomics, showing competitive performance against other state-of-the-art methods To promote its widespread use, we introduce here a Bioconductor R package for performing ROTS analysis conveniently on different types of omics data To illustrate the benefits of ROTS in various applications, we present three case studies, involving proteomics and RNA-seq data from public repositories, including both bulk and single cell data The package is freely available from Bioconductor (https://wwwbioconductororg/packages/ROTS)
TL;DR: GRcalculator is a powerful, user-friendly, and free tool that provides a unified platform for investigators to analyze dose–response data across diverse cell types and perturbagens and facilitates inclusion of GR metrics calculations within existing R analysis pipelines.
Abstract: Quantifying the response of cell lines to drugs or other perturbagens is the cornerstone of pre-clinical drug development and pharmacogenomics as well as a means to study factors that contribute to sensitivity and resistance. In dividing cells, traditional metrics derived from dose–response curves such as IC
50
, AUC, and E
max
, are confounded by the number of cell divisions taking place during the assay, which varies widely for biological and experimental reasons. Hafner et al. (Nat Meth 13:521–627, 2016) recently proposed an alternative way to quantify drug response, normalized growth rate (GR) inhibition, that is robust to such confounders. Adoption of the GR method is expected to improve the reproducibility of dose–response assays and the reliability of pharmacogenomic associations (Hafner et al. 500–502, 2017). We describe here an interactive website (
www.grcalculator.org
) for calculation, analysis, and visualization of dose–response data using the GR approach and for comparison of GR and traditional metrics. Data can be user-supplied or derived from published datasets. The web tools are implemented in the form of three integrated Shiny applications (grcalculator, grbrowser, and grtutorial) deployed through a Shiny server. Intuitive graphical user interfaces (GUIs) allow for interactive analysis and visualization of data. The Shiny applications make use of two R packages (shinyLi and GRmetrics) specifically developed for this purpose. The GRmetrics R package is also available via Bioconductor and can be used for offline data analysis and visualization. Source code for the Shiny applications and associated packages (shinyLi and GRmetrics) can be accessed at www.github.com/uc-bd2k/grcalculator
and www.github.com/datarail/gr_metrics
. GRcalculator is a powerful, user-friendly, and free tool to facilitate analysis of dose–response data. It generates publication-ready figures and provides a unified platform for investigators to analyze dose–response data across diverse cell types and perturbagens (including drugs, biological ligands, RNAi, etc.). GRcalculator also provides access to data collected by the NIH LINCS Program (
http://www.lincsproject.org
/) and other public domain datasets. The GRmetrics Bioconductor package provides computationally trained users with a platform for offline analysis of dose–response data and facilitates inclusion of GR metrics calculations within existing R analysis pipelines. These tools are therefore well suited to users in academia as well as industry.
TL;DR: The MultiAssayExperiment Bioconductor package reduces major obstacles to efficient, scalable, and reproducible statistical analysis of multiomics data and enhances data science applications of multiple omics datasets.
TL;DR: The open-source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands, and extends popular plots found in the limma package to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points.
Abstract: Motivation: Summary graphics for RNA-sequencing and microarray gene expression analyses may contain upwards of tens of thousands of points. Details about certain genes or samples of interest are easily obscured in such dense summary displays. Incorporating interactivity into summary plots would enable additional information to be displayed on demand and facilitate intuitive data exploration. Results: The open-source Glimma package creates interactive graphics for exploring gene expression analysis with a few simple R commands. It extends popular plots found in the limma package, such as multi-dimensional scaling plots and mean-difference plots, to allow individual data points to be queried and additional annotation information to be displayed upon hovering or selecting particular points. It also offers links between plots so that more information can be revealed on demand. Glimma is widely applicable, supporting data analyses from a number of well established Bioconductor workflows (limma, edgeR and DESeq2) and uses D3/JavaScript to produce HTML pages with interactive displays that enable more effective data exploration by end-users. Results from Glimma can be easily shared between bioinformaticians and biologists, enhancing reporting capabilities while maintaining reproducibility. Availability and Implementation: The Glimma R package is available from http://bioconductor.org/packages/devel/bioc/html/Glimma.html.
TL;DR: The derfinder software is presented, implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, and introducing a flexible statistical modeling framework, including multi-group and time-course analyses.
Abstract: Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.
TL;DR: This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer.
Abstract: Gene set enrichment analysis is a popular approach for prioritising the biological processes perturbed in genomic datasets. The Bioconductor project hosts over 80 software packages capable of gene set analysis. Most of these packages search for enriched signatures amongst differentially regulated genes to reveal higher level biological themes that may be missed when focusing only on evidence from individual genes. With so many different methods on offer, choosing the best algorithm and visualization approach can be challenging. The EGSEA package solves this problem by combining results from up to 12 prominent gene set testing algorithms to obtain a consensus ranking of biologically relevant results.This workflow demonstrates how EGSEA can extend limma-based differential expression analyses for RNA-seq and microarray data using experiments that profile 3 distinct cell populations important for studying the origins of breast cancer. Following data normalization and set-up of an appropriate linear model for differential expression analysis, EGSEA builds gene signature specific indexes that link a wide range of mouse or human gene set collections obtained from MSigDB, GeneSetDB and KEGG to the gene expression data being investigated. EGSEA is then configured and the ensemble enrichment analysis run, returning an object that can be queried using several S4 methods for ranking gene sets and visualizing results via heatmaps, KEGG pathway views, GO graphs, scatter plots and bar plots. Finally, an HTML report that combines these displays can fast-track the sharing of results with collaborators, and thus expedite downstream biological validation. EGSEA is simple to use and can be easily integrated with existing gene expression analysis pipelines for both human and mouse data.
TL;DR: The central part of the workflow is developed a Bayesian statistical model that transforms the enrichment read counts to absolute levels of methylation and, thus, enhances interpretability and facilitates comparison with other methylation assays.
Abstract: Genome-wide enrichment of methylated DNA followed by sequencing (MeDIP-seq) offers a reasonable compromise between experimental costs and genomic coverage. However, the computational analysis of these experiments is complex, and quantification of the enrichment signals in terms of absolute levels of methylation requires specific transformation. In this work, we present QSEA, Quantitative Sequence Enrichment Analysis, a comprehensive workflow for the modelling and subsequent quantification of MeDIP-seq data. As the central part of the workflow we have developed a Bayesian statistical model that transforms the enrichment read counts to absolute levels of methylation and, thus, enhances interpretability and facilitates comparison with other methylation assays. We suggest several calibration strategies for the critical parameters of the model, either using additional data or fairly general assumptions. By comparing the results with bisulfite sequencing (BS) validation data, we show the improvement of QSEA over existing methods. Additionally, we generated a clinically relevant benchmark data set consisting of methylation enrichment experiments (MeDIP-seq), BS-based validation experiments (Methyl-seq) as well as gene expression experiments (RNA-seq) derived from non-small cell lung cancer patients, and show that the workflow retrieves well-known lung tumour methylation markers that are causative for gene expression changes, demonstrating the applicability of QSEA for clinical studies. QSEA is implemented in R and available from the Bioconductor repository 3.4 (www.bioconductor.org/packages/qsea).
TL;DR: This work investigates the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data.
Abstract: Although a large number of clustering algorithms have been proposed to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA sequencing (RNA-seq) data remains unaddressed. In this work, we investigate the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data. This approach has the advantage of accounting for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In addition, it provides a rigorous statistical framework for parameter estimation, an objective assessment of data transformations and number of clusters and the possibility of performing diagnostic checks on the quality and homogeneity of the identified clusters. We analyze four varied RNA-seq data sets to illustrate the use of transformations and model selection in conjunction with Gaussian mixture models. Finally, we propose a Bioconductor package coseq (co-expression of RNA-seq data) to facilitate implementation and visualization of the recommended RNA-seq co-expression analyses.
TL;DR: It is shown that scone is able to correctly rank normalization methods according to their performance in a given dataset and that selecting the best performing normalization leads to higher agreement with independent validation data than lowly-ranked methods.
Abstract: Systematic measurement biases make data normalization an essential preprocessing step in single-cell RNA sequencing (scRNA-seq) analysis. There may be multiple, competing considerations behind the assessment of normalization performance, some of them study-specific. Because normalization can have a large impact on downstream results (e.g., clustering and differential expression), it is critically important that practitioners assess the performance of competing methods. We have developed scone - a flexible framework for assessing normalization performance based on a comprehensive panel of data-driven metrics. Through graphical summaries and quantitative reports, scone summarizes performance trade-offs and ranks large numbers of normalization methods by aggregate panel performance. The method is implemented in the open-source Bioconductor R software package scone. We demonstrate the effectiveness of scone on a collection of scRNA-seq datasets, generated with different protocols, including Fluidigm C1 and 10x platforms. We show that top-performing normalization methods lead to better agreement with independent validation data.
TL;DR: The GUIDEseq package enables analysis of GUIDE-data from various nuclease platforms for any species with a defined genomic sequence and annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off- target sites for further characterization.
Abstract: Genome editing technologies developed around the CRISPR-Cas9 nuclease system have facilitated the investigation of a broad range of biological questions These nucleases also hold tremendous promise for treating a variety of genetic disorders In the context of their therapeutic application, it is important to identify the spectrum of genomic sequences that are cleaved by a candidate nuclease when programmed with a particular guide RNA, as well as the cleavage efficiency of these sites Powerful new experimental approaches, such as GUIDE-seq, facilitate the sensitive, unbiased genome-wide detection of nuclease cleavage sites within the genome Flexible bioinformatics analysis tools for processing GUIDE-seq data are needed Here, we describe an open source, open development software suite, GUIDEseq, for GUIDE-seq data analysis and annotation as a Bioconductor package in R The GUIDEseq package provides a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions For each identified off-target, the GUIDEseq package outputs mapped GUIDE-Seq read count as well as cleavage score from a user specified off-target cleavage score prediction algorithm permitting the identification of genomic sequences with unexpected cleavage activity The GUIDEseq package enables analysis of GUIDE-data from various nuclease platforms for any species with a defined genomic sequence This software package has been used successfully to analyze several GUIDE-seq datasets The software, source code and documentation are freely available at http://wwwbioconductororg/packages/release/bioc/html/GUIDEseqhtml
TL;DR: Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives, applicable to any type of omics data that can be represented in a matrix format.
Abstract: Gene set analysis (in a form of functionally related genes or pathways) has become the method of choice for analyzing omics data in general and gene expression data in particular. There are many statistical methods that either summarize gene-level statistics for a gene set or apply a multivariate statistic that accounts for intergene correlations. Most available methods detect complex departures from the null hypothesis but lack the ability to identify the specific alternative hypothesis that rejects the null. GSAR (Gene Set Analysis in R) is an open-source R/Bioconductor software package for gene set analysis (GSA). It implements self-contained multivariate non-parametric statistical methods testing a complex null hypothesis against specific alternatives, such as differences in mean (shift), variance (scale), or net correlation structure. The package also provides a graphical visualization tool, based on the union of two minimum spanning trees, for correlation networks to examine the change in the correlation structures of a gene set between two conditions and highlight influential genes (hubs). Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives. The methods in package GSAR are applicable to any type of omics data that can be represented in a matrix format. The package, with detailed instructions and examples, is freely available under the GPL (> = 2) license from the Bioconductor web site.
TL;DR: An R package that combines known algorithms and innovative methods for the efficient, flexible and near‐optimal generation of robust barcode sets, designed for speed, versatility, provable correctness and large set sizes is prepared.
Abstract: Motivation DNA barcodes are commonly used for counting and discriminating purposes in molecular and cell biology. Not every set of DNA sequences is equally suitable for this goal. There is a growing demand for more sophisticated barcode designs, with only few tools available. We prepared an R package that combines known algorithms and innovative methods for the efficient, flexible and near-optimal generation of robust barcode sets. Results Our R-software package 'DNABarcodes' generates sets of DNA barcodes from a few basic input parameters (e.g. length, distance metric, minimum distance, chemical properties). It satisfies the specifics of most particular experimental demands in de novo design of barcodes. Additionally, the package allows analysing existing sets of DNA barcodes as well as the generation of subsets of those existing sets to improve their error correction and detection properties. 'DNABarcodes' was designed for speed, versatility, provable correctness and large set sizes. Availability and implementation The DNABarcodes R package is available from Bioconductor at http://bioconductor.org/packages/DNABarcodes under the GPL-2 license. Contact tilo.buschmann@izi.fraunhofer.de. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: bcbioRNASeq, a Bioconductor package that provides ready-to-render templates, objects and wrapper functions to post-process bcbio RNA sequencing output data, helps automate the generation of high-level RNA-seq reports, facilitating the quality control analyses, identification of differentially expressed genes and functional enrichment analyses.
Abstract: RNA-seq analysis involves multiple steps, from processing raw sequencing data to identifying, organizing, annotating, and reporting differentially expressed genes. bcbio is an open source, community-maintained framework providing automated and scalable RNA-seq methods for identifying gene abundance counts. We have developed bcbioRNASeq, a Bioconductor package that provides ready-to-render templates, objects and wrapper functions to post-process bcbio RNA sequencing output data. bcbioRNASeq helps automate the generation of high-level RNA-seq reports, facilitating the quality control analyses, identification of differentially expressed genes and functional enrichment analyses.
TL;DR: CuratedMetagenomicData provides standardized per-participant metadata linked to bacterial, fungal, archaeal, and viral taxonomic abundances, as well as quantitative metabolic functional profiles, generated by the HUMAnN2 and MetaPhlAn2 pipelines.
Abstract: We present curatedMetagenomicData, a Bioconductor and command-line resource providing thousands of metagenomic profiles from the Human Microbiome Project and other publicly available datasets, and ExperimentHub, for convenient cloud-based distribution of data to the R desktop. curatedMetagenomicData provides standardized per-participant metadata linked to bacterial, fungal, archaeal, and viral taxonomic abundances, as well as quantitative metabolic functional profiles, generated by the HUMAnN2 and MetaPhlAn2 pipelines. The resulting datasets can be immediately analyzed with a wide range of statistical methods, requiring a minimum of bioinformatic expertise and no preprocessing of data. We demonstrate exploratory data analysis, an investigation of gut "enterotypes", and a comparison of the accuracy of disease classification from different data types. These documented analyses can be reproduced efficiently on a laptop, without the barriers of working with large-scale, raw sequencing data. The development of curatedMetagenomicData will continue with the addition, curation, and analysis of further microbiome datasets.
TL;DR: MutationalPatterns is an R/Bioconductor package that characterizes a broad range of mutational patterns and potential relations with (epi-)genomic features and offers an efficient method to quantify the contribution of known mutational signatures.
Abstract: Base substitution catalogs represent historical records of mutational processes that have been active in a system. Such processes can be distinguished by typical characteristics, like mutation type, sequence context, transcriptional and replicative strand bias, and distribution throughout the genome. MutationalPatterns is an R/Bioconductor package that characterizes this broad range of mutational patterns and potential relations with (epi-)genomic features. Furthermore, it offers an efficient method to quantify the contribution of known mutational signatures. Such analyses can be used to determine whether certain DNA repair mechanisms are perturbed and to further characterize the processes underlying known mutational signatures.
Keywords: R, Base substitutions, Somatic mutations, Mutational signatures, Mutational processes, Transcriptional strand bias.
Availability and implementation: The MutationalPatterns R package is freely available for download at https://www.bioconductor.org/packages/release/bioc/html/MutationalPatterns.html. The package documentation provides a detailed description of typical analysis workflows.
TL;DR: MultiDataSet is a suitable class for data integration under R and Bioconductor framework that deals with the usual difficulties of managing multiple and non-complete data sets while offering a simple and general way of subsetting features and selecting samples.
Abstract: Reduction in the cost of genomic assays has generated large amounts of biomedical-related data. As a result, current studies perform multiple experiments in the same subjects. While Bioconductor’s methods and classes implemented in different packages manage individual experiments, there is not a standard class to properly manage different omic datasets from the same subjects. In addition, most R/Bioconductor packages that have been designed to integrate and visualize biological data often use basic data structures with no clear general methods, such as subsetting or selecting samples. To cover this need, we have developed MultiDataSet, a new R class based on Bioconductor standards, designed to encapsulate multiple data sets. MultiDataSet deals with the usual difficulties of managing multiple and non-complete data sets while offering a simple and general way of subsetting features and selecting samples. We illustrate the use of MultiDataSet in three common situations: 1) performing integration analysis with third party packages; 2) creating new methods and functions for omic data integration; 3) encapsulating new unimplemented data from any biological experiment. MultiDataSet is a suitable class for data integration under R and Bioconductor framework.
TL;DR: PanViz allows visualization of changes in gene group classification as different subsets of pangenomes are selected, as well as comparisons of individual genomes to pANGenomes with gene ontology based navigation of gene groups.
Abstract: Summary PanViz is a novel, interactive, visualization tool for pangenome analysis. PanViz allows visualization of changes in gene group (groups of similar genes across genomes) classification as different subsets of pangenomes are selected, as well as comparisons of individual genomes to pangenomes with gene ontology based navigation of gene groups. Furthermore it allows for rich and complex visual querying of gene groups in the pangenome. PanViz visualizations require no external programs and are easily sharable, allowing for rapid pangenome analyses. Availability and implementation PanViz is written entirely in JavaScript and is available on https://github.com/thomasp85/PanViz . A companion R package that facilitates the creation of PanViz visualizations from a range of data formats is released through Bioconductor and is available at https://bioconductor.org/packages/PanVizGenerator . Contact thomasp85@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.