TL;DR: An R Bioconductor package, Maftools, is described, which offers a multitude of analysis and visualization modules that are commonly used in cancer genomic studies, including driver gene identification, pathway, signature, enrichment, and association analyses, and is independent of larger alignment files.
Abstract: Numerous large-scale genomic studies of matched tumor-normal samples have established the somatic landscapes of most cancer types. However, the downstream analysis of data from somatic mutations entails a number of computational and statistical approaches, requiring usage of independent software and numerous tools. Here, we describe an R Bioconductor package, Maftools, which offers a multitude of analysis and visualization modules that are commonly used in cancer genomic studies, including driver gene identification, pathway, signature, enrichment, and association analyses. Maftools only requires somatic variants in Mutation Annotation Format (MAF) and is independent of larger alignment files. With the implementation of well-established statistical and computational methods, Maftools facilitates data-driven research and comparative analysis to discover novel results from publicly available data sets. In the present study, using three of the well-annotated cohorts from The Cancer Genome Atlas (TCGA), we describe the application of Maftools to reproduce known results. More importantly, we show that Maftools can also be used to uncover novel findings through integrative analysis.
TL;DR: iDEP helps unveil the multifaceted functions of p53 and the possible involvement of several microRNAs such as miR-92a, miR/Bioconductor packages, 2 web services, and comprehensive annotation and pathway databases for 220 plant and animal species.
Abstract: RNA-seq is widely used for transcriptomic profiling, but the bioinformatics analysis of resultant data can be time-consuming and challenging, especially for biologists. We aim to streamline the bioinformatic analyses of gene-level data by developing a user-friendly, interactive web application for exploratory data analysis, differential expression, and pathway analysis. iDEP (integrated Differential Expression and Pathway analysis) seamlessly connects 63 R/Bioconductor packages, 2 web services, and comprehensive annotation and pathway databases for 220 plant and animal species. The workflow can be reproduced by downloading customized R code and related pathway files. As an example, we analyzed an RNA-Seq dataset of lung fibroblasts with Hoxa1 knockdown and revealed the possible roles of SP1 and E2F1 and their target genes, including microRNAs, in blocking G1/S transition. In another example, our analysis shows that in mouse B cells without functional p53, ionizing radiation activates the MYC pathway and its downstream genes involved in cell proliferation, ribosome biogenesis, and non-coding RNA metabolism. In wildtype B cells, radiation induces p53-mediated apoptosis and DNA repair while suppressing the target genes of MYC and E2F1, and leads to growth and cell cycle arrest. iDEP helps unveil the multifaceted functions of p53 and the possible involvement of several microRNAs such as miR-92a, miR-504, and miR-30a. In both examples, we validated known molecular pathways and generated novel, testable hypotheses. Combining comprehensive analytic functionalities with massive annotation databases, iDEP (
http://ge-lab.org/idep/
) enables biologists to easily translate transcriptomic and proteomic data into actionable insights.
TL;DR: This work proposes apeglm, which uses a heavy-tailed Cauchy prior distribution for effect sizes, resulting in lower bias than previous shrinkage estimators, while still reducing variance.
Abstract: In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across experimental conditions, despite technical and biological variability in the observations. A common task is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC) in expression levels. When the counts of reads are low or highly variable in either or both conditions, the maximum likelihood estimates for the LFCs has high variance, leading to large estimates not representative of true differences, and poor ranking of genes by effect size. One approach is to introduce filtering thresholds and pseudocounts to exclude or moderate estimated LFCs. Filtering may result in a loss of genes from the analysis with true differences in expression across conditions, while pseudocounts provide a limited solution that needs to be adapted per dataset. Here, we propose the use of a heavy-tailed Cauchy prior distribution for effect sizes, which avoids the use of filter thresholds or pseudocounts. The proposed method, Approximate Posterior Estimation for GLM, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference. The apeglm package is available as an R/Bioconductor package at http://bioconductor.org/packages/apeglm, and the methods can be called from within the DESeq2 software.
TL;DR: MutationalPatterns, an R/Bioconductor package that allows researchers to characterize a broad range of patterns in base substitution catalogues to dissect the underlying molecular mechanisms, offers an efficient method to quantify the contribution of known mutational signatures within single samples.
Abstract: Base substitution catalogues represent historical records of mutational processes that have been active in a cell. Such processes can be distinguished by various characteristics, like mutation type, sequence context, transcriptional and replicative strand bias, genomic distribution and association with (epi)-genomic features. We have created MutationalPatterns, an R/Bioconductor package that allows researchers to characterize a broad range of patterns in base substitution catalogues to dissect the underlying molecular mechanisms. Furthermore, it offers an efficient method to quantify the contribution of known mutational signatures within single samples. This analysis can be used to determine whether certain DNA repair mechanisms are perturbed and to further characterize the processes underlying known mutational signatures. MutationalPatterns allows for easy characterization and visualization of mutational patterns. These analyses willsupport fundamental research into mutational mechanisms and may ultimately improve cancer diagnosis and treatment strategies. MutationalPatterns is freely available at http://bioconductor.org/packages/MutationalPatterns
.
TL;DR: The singscore method functions independent of sample composition in gene expression data and thus it provides stable scores, which are particularly useful for small data sets or data integration, and includes a suite of powerful visualization functions to assist in the interpretation of results.
Abstract: Gene set scoring provides a useful approach for quantifying concordance between sample transcriptomes and selected molecular signatures. Most methods use information from all samples to score an individual sample, leading to unstable scores in small data sets and introducing biases from sample composition (e.g. varying numbers of samples for different cancer subtypes). To address these issues, we have developed a truly single sample scoring method, and associated R/Bioconductor package singscore (
https://bioconductor.org/packages/singscore
). We use multiple cancer data sets to compare singscore against widely-used methods, including GSVA, z-score, PLAGE, and ssGSEA. Our approach does not depend upon background samples and scores are thus stable regardless of the composition and number of samples being scored. In contrast, scores obtained by GSVA, z-score, PLAGE and ssGSEA can be unstable when less data are available (NS < 25). The singscore method performs as well as the best performing methods in terms of power, recall, false positive rate and computational time, and provides consistently high and balanced performance across all these criteria. To enhance the impact and utility of our method, we have also included a set of functions implementing visual analysis and diagnostics to support the exploration of molecular phenotypes in single samples and across populations of data. The singscore method described here functions independent of sample composition in gene expression data and thus it provides stable scores, which are particularly useful for small data sets or data integration. Singscore performs well across all performance criteria, and includes a suite of powerful visualization functions to assist in the interpretation of results. This method performs as well as or better than other scoring approaches in terms of its power to distinguish samples with distinct biology and its ability to call true differential gene sets between two conditions. These scores can be used for dimensional reduction of transcriptomic data and the phenotypic landscapes obtained by scoring samples against multiple molecular signatures may provide insights for sample stratification.
TL;DR: A systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods, found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustered.
Abstract: Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( https://bioconductor.org/packages/DuoClustering2018).
TL;DR: The CEMiTool R package provides users with an easy-to-use method to automatically implement gene co-expression network analyses, obtain key information about the discovered gene modules using additional downstream analyses and retrieve publication-ready results via a high-quality interactive report.
Abstract: The analysis of modular gene co-expression networks is a well-established method commonly used for discovering the systems-level functionality of genes. In addition, these studies provide a basis for the discovery of clinically relevant molecular pathways underlying different diseases and conditions. In this paper, we present a fast and easy-to-use Bioconductor package named CEMiTool that unifies the discovery and the analysis of co-expression modules. Using the same real datasets, we demonstrate that CEMiTool outperforms existing tools, and provides unique results in a user-friendly html report with high quality graphs. Among its features, our tool evaluates whether modules contain genes that are over-represented by specific pathways or that are altered in a specific sample group, as well as it integrates transcriptomic data with interactome information, identifying the potential hubs on each network. We successfully applied CEMiTool to over 1000 transcriptome datasets, and to a new RNA-seq dataset of patients infected with Leishmania, revealing novel insights of the disease’s physiopathology. The CEMiTool R package provides users with an easy-to-use method to automatically implement gene co-expression network analyses, obtain key information about the discovered gene modules using additional downstream analyses and retrieve publication-ready results via a high-quality interactive report.
TL;DR: An R package DEsingle was developed which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect three types of DE genes in scRNA-seq data with higher accuracy.
Abstract: Summary The excessive amount of zeros in single-cell RNA-seq (scRNA-seq) data includes 'real' zeros due to the on-off nature of gene transcription in single cells and 'dropout' zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect three types of DE genes in scRNA-seq data with higher accuracy. Availability and implementation The R package DEsingle is freely available at Bioconductor (https://bioconductor.org/packages/DEsingle). Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: A new R package named EnrichedHeatmap is presented that efficiently visualizes genomic signal enrichment and provides advanced solutions for normalizing genomic signals within target regions as well as offering highly customizable visualizations.
Abstract: High-throughput sequencing data are dramatically increasing in volume. Thus, there is urgent need for efficient tools to perform fast and integrative analysis of multiple data types. Enriched heatmap is a specific form of heatmap that visualizes how genomic signals are enriched over specific target regions. It is commonly used and efficient at revealing enrichment patterns especially for high dimensional genomic and epigenomic datasets. We present a new R package named EnrichedHeatmap that efficiently visualizes genomic signal enrichment. It provides advanced solutions for normalizing genomic signals within target regions as well as offering highly customizable visualizations. The major advantage of EnrichedHeatmap is the ability to conveniently generate parallel heatmaps as well as complex annotations, which makes it easy to integrate and visualize comprehensive overviews of the patterns and associations within and between complex datasets. EnrichedHeatmap facilitates comprehensive understanding of high dimensional genomic and epigenomic data. The power of EnrichedHeatmap is demonstrated by visualization of the complex associations between DNA methylation, gene expression and various histone modifications.
TL;DR: A Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data, and has been used successfully for preprocessing and assessing several in-house and public ATac-seq datasets.
Abstract: ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. While several tools have been developed or adopted for assessing read quality, identifying nucleosome occupancy and accessible regions from ATAC-seq data, none of the tools provide a comprehensive set of functionalities for preprocessing and quality assessment of aligned ATAC-seq datasets. We have developed a Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling. Here we demonstrate the utilities of our package using 25 publicly available ATAC-seq datasets from four studies. We also provide guidelines on what the diagnostic plots should look like for an ideal ATAC-seq dataset. This software package has been used successfully for preprocessing and assessing several in-house and public ATAC-seq datasets. Diagnostic plots generated by this package will facilitate the quality assessment of ATAC-seq data, and help researchers to evaluate their own ATAC-seq experiments as well as select high-quality ATAC-seq datasets from public repositories such as GEO to avoid generating hypotheses or drawing conclusions from low-quality ATAC-seq experiments. The software, source code, and documentation are freely available as a Bioconductor package at https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html
.
TL;DR: A user‐friendly R/Bioconductor package, named GDCRNATools, for downloading, organizing and analyzing RNA data in GDC with an emphasis on deciphering the lncRNA‐mRNA related competing endogenous RNAs regulatory network in cancers.
Abstract: Motivation The large-scale multidimensional omics data in the Genomic Data Commons (GDC) provides opportunities to investigate the crosstalk among different RNA species and their regulatory mechanisms in cancers. Easy-to-use bioinformatics pipelines are needed to facilitate such studies. Results We have developed a user-friendly R/Bioconductor package, named GDCRNATools, for downloading, organizing and analyzing RNA data in GDC with an emphasis on deciphering the lncRNA-mRNA related competing endogenous RNAs regulatory network in cancers. Many widely used bioinformatics tools and databases are utilized in our package. Users can easily pack preferred downstream analysis pipelines or integrate their own pipelines into the workflow. Interactive shiny web apps built in GDCRNATools greatly improve visualization of results from the analysis. Availability and implementation GDCRNATools is an R/Bioconductor package that is freely available at Bioconductor (http://bioconductor.org/packages/devel/bioc/html/GDCRNATools.html). Detailed instructions, manual and example code are also available in Github (https://github.com/Jialab-UCR/GDCRNATools).
TL;DR: The method introduces a distance-centric analysis and visualization of the differences between two Hi-C datasets on a single plot that allows for a data-driven normalization of biases using locally weighted linear regression (loess), and is able to remove between-dataset bias present inHi-C matrices.
Abstract: Changes in spatial chromatin interactions are now emerging as a unifying mechanism orchestrating the regulation of gene expression. Hi-C sequencing technology allows insight into chromatin interactions on a genome-wide scale. However, Hi-C data contains many DNA sequence- and technology-driven biases. These biases prevent effective comparison of chromatin interactions aimed at identifying genomic regions differentially interacting between, e.g., disease-normal states or different cell types. Several methods have been developed for normalizing individual Hi-C datasets. However, they fail to account for biases between two or more Hi-C datasets, hindering comparative analysis of chromatin interactions. We developed a simple and effective method, HiCcompare, for the joint normalization and differential analysis of multiple Hi-C datasets. The method introduces a distance-centric analysis and visualization of the differences between two Hi-C datasets on a single plot that allows for a data-driven normalization of biases using locally weighted linear regression (loess). HiCcompare outperforms methods for normalizing individual Hi-C datasets and methods for differential analysis (diffHiC, FIND) in detecting a priori known chromatin interaction differences while preserving the detection of genomic structures, such as A/B compartments. HiCcompare is able to remove between-dataset bias present in Hi-C matrices. It also provides a user-friendly tool to allow the scientific community to perform direct comparisons between the growing number of pre-processed Hi-C datasets available at online repositories. HiCcompare is freely available as a Bioconductor R package https://bioconductor.org/packages/HiCcompare/
.
TL;DR: ScPipe as mentioned in this paper is an R/Bioconductor package that integrates barcode demultiplexing, read alignment, UMI-aware gene-level quantification and quality control of raw sequencing data generated by multiple protocols.
Abstract: Single-cell RNA sequencing (scRNA-seq) technology allows researchers to profile the transcriptomes of thousands of cells simultaneously. Protocols that incorporate both designed and random barcodes have greatly increased the throughput of scRNA-seq, but give rise to a more complex data structure. There is a need for new tools that can handle the various barcoding strategies used by different protocols and exploit this information for quality assessment at the sample-level and provide effective visualization of these results in preparation for higher-level analyses. To this end, we developed scPipe, an R/Bioconductor package that integrates barcode demultiplexing, read alignment, UMI-aware gene-level quantification and quality control of raw sequencing data generated by multiple protocols that include CEL-seq, MARS-seq, Chromium 10X, Drop-seq and Smart-seq. scPipe produces a count matrix that is essential for downstream analysis along with an HTML report that summarises data quality. These results can be used as input for downstream analyses including normalization, visualization and statistical testing. scPipe performs this processing in a few simple R commands, promoting reproducible analysis of single-cell data that is compatible with the emerging suite of open-source scRNA-seq analysis tools available in R/Bioconductor and beyond. The scPipe R package is available for download from https://www.bioconductor.org/packages/scPipe.
TL;DR: The iSEE (Interactive SummarizedExperiment Explorer) software package is presented, which provides a general visual interface for exploring data in a SummarizationExperiment object, and provides useful features such as simultaneous examination of (meta)data and analysis results, dynamic linking between plots and code tracking for reproducibility.
Abstract: Data exploration is critical to the comprehension of large biological data sets generated by high-throughput assays such as sequencing. However, most existing tools for interactive visualisation are limited to specific assays or analyses. Here, we present the iSEE (Interactive SummarizedExperiment Explorer) software package, which provides a general visual interface for exploring data in a SummarizedExperiment object. iSEE is directly compatible with many existing R/Bioconductor packages for analysing high-throughput biological data, and provides useful features such as simultaneous examination of (meta)data and analysis results, dynamic linking between plots and code tracking for reproducibility. We demonstrate the utility and flexibility of iSEE by applying it to explore a range of real transcriptomics and proteomics data sets.
TL;DR: DMRcaller is a comprehensive tool for differential methylation analysis which displays high sensitivity and specificity for the detection of DMRs and performs entire genome wide analysis within a few hours.
Abstract: DNA methylation has been associated with transcriptional repression and detection of differential methylation is important in understanding the underlying causes of differential gene expression. Bisulfite-converted genomic DNA sequencing is the current gold standard in the field for building genome-wide maps at a base pair resolution of DNA methylation. Here we systematically investigate the underlying features of detecting differential DNA methylation in CpG and non-CpG contexts, considering both the case of mammalian systems and plants. In particular, we introduce DMRcaller, a highly efficient R/Bioconductor package, which implements several methods to detect differentially methylated regions (DMRs) between two samples. Most importantly, we show that different algorithms are required to compute DMRs and the most appropriate algorithm in each case depends on the sequence context and levels of methylation. Furthermore, we show that DMRcaller outperforms other available packages and we propose a new method to select the parameters for this tool and for other available tools. DMRcaller is a comprehensive tool for differential methylation analysis which displays high sensitivity and specificity for the detection of DMRs and performs entire genome wide analysis within a few hours.
TL;DR: This work presents a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU, and shows how these packages can be used downstream of RNA-seq quantification using the Salmon software package.
Abstract: Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.
TL;DR: Diffloop, an R/Bioconductor package that provides a suite of functions for the quality control, statistical testing, annotation, and visualization of DNA loops, is introduced and demonstrated by detecting differences between ENCODE ChIA-PET samples and relating looping to variability in epigenetic state.
Abstract: Summary The 3D architecture of DNA within the nucleus is a key determinant of interactions between genes, regulatory elements, and transcriptional machinery. As a result, differences in DNA looping structure are associated with variation in gene expression and cell state. To systematically assess changes in DNA looping architecture between samples, we introduce diffloop, an R/Bioconductor package that provides a suite of functions for the quality control, statistical testing, annotation, and visualization of DNA loops. We demonstrate this functionality by detecting differences between ENCODE ChIA-PET samples and relate looping to variability in epigenetic state. Availability and implementation Diffloop is implemented as an R/Bioconductor package available at https://bioconductor.org/packages/release/bioc/html/diffloop.html. Contact aryee.martin@mgh.harvard.edu. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: The Bioconductor package miRBaseConverter and the Shiny-based web application are presented to provide a suite of functions for converting and retrieving miRNA name, accession, sequence, species, version and family information in different versions of miR base.
Abstract: miRBase is the primary repository for published miRNA sequence and annotation data, and serves as the “go-to” place for miRNA research. However, the definition and annotation of miRNAs have been changed significantly across different versions of miRBase. The changes cause inconsistency in miRNA related data between different databases and articles published at different times. Several tools have been developed for different purposes of querying and converting the information of miRNAs between different miRBase versions, but none of them individually can provide the comprehensive information about miRNAs in miRBase and users will need to use a number of different tools in their analyses. We introduce miRBaseConverter, an R package integrating the latest miRBase version 22 available in Bioconductor to provide a suite of functions for converting and retrieving miRNA name (ID), accession, sequence, species, version and family information in different versions of miRBase. The package is implemented in R and available under the GPL-2 license from the Bioconductor website (
http://bioconductor.org/packages/miRBaseConverter/
). A Shiny-based GUI suitable for non-R users is also available as a standalone application from the package and also as a web application at http://nugget.unisa.edu.au:3838/miRBaseConverter
. miRBaseConverter has a built-in database for querying miRNA information in all species and for both pre-mature and mature miRNAs defined by miRBase. In addition, it is the first tool for batch querying the miRNA family information. The package aims to provide a comprehensive and easy-to-use tool for miRNA research community where researchers often utilize published miRNA data from different sources. The Bioconductor package miRBaseConverter and the Shiny-based web application are presented to provide a suite of functions for converting and retrieving miRNA name, accession, sequence, species, version and family information in different versions of miRBase. The package will serve a wide range of applications in miRNA research and could provide a full view of the miRNAs of interest.
TL;DR: SIMLR (Single‐cell Interpretation via Multi‐kernel LeaRning), an open‐source tool that implements a novel framework to learn a sample‐to‐sample similarity measure from expression data observed for heterogenous samples, is presented here.
Abstract: SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a sample-to-sample similarity measure from expression data observed for heterogenous samples, is presented here. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of samples. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization. SIMLR is available on https://github.com/BatzoglouLabSU/SIMLRGitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on http://bioconductor.org.
TL;DR: DaMiRseq offers an organized, flexible and convenient framework to remove noise and bias, select the most informative features and perform accurate classification for high-dimensional genomic data analysis.
Abstract: Summary RNA-Seq is becoming the technique of choice for high-throughput transcriptome profiling, which, besides class comparison for differential expression, promises to be an effective and powerful tool for biomarker discovery. However, a systematic analysis of high-dimensional genomic data is a demanding task for such a purpose. DaMiRseq offers an organized, flexible and convenient framework to remove noise and bias, select the most informative features and perform accurate classification. Availability and implementation DaMiRseq is developed for the R environment (R ≥ 3.4) and is released under GPL (≥2) License. The package runs on Windows, Linux and Macintosh operating systems and is freely available to non-commercial users at the Bioconductor open-source, open-development software project repository (https://bioconductor.org/packages/DaMiRseq/). In compliance with Bioconductor standards, the authors ensure stable package maintenance through software and documentation updates. Contact luca.piacentini@ccfm.it. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: esATAC is a highly integrated easy‐to‐use R/Bioconductor package, for systematic ATAC‐seq data analysis that covers essential steps for full analyzing procedure, including raw data processing, quality control and downstream statistical analysis such as peak calling, enrichment analysis and transcription factor footprinting.
Abstract: Summary ATAC-seq is rapidly emerging as one of the major experimental approaches to probe chromatin accessibility genome-wide. Here, we present 'esATAC', a highly integrated easy-to-use R/Bioconductor package, for systematic ATAC-seq data analysis. It covers essential steps for full analyzing procedure, including raw data processing, quality control and downstream statistical analysis such as peak calling, enrichment analysis and transcription factor footprinting. esATAC supports one command line execution for preset pipelines and provides flexible interfaces for building customized pipelines. Availability and implementation esATAC package is open source under the GPL-3.0 license. It is implemented in R and C++. Source code and binaries for Linux, MAC OS X and Windows are available through Bioconductor (https://www.bioconductor.org/packages/release/bioc/html/esATAC.html). Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: The drawProteins package is to enable the generation of schematics of proteins in an automated fashion that can integrate with the Bioconductor/R suite of tools for bioinformatics and statistical analysis.
Abstract: Protein schematics are valuable for research, teaching and knowledge communication. However, the tools used to automate the process are challenging. The purpose of the drawProteins package is to enable the generation of schematics of proteins in an automated fashion that can integrate with the Bioconductor/R suite of tools for bioinformatics and statistical analysis. Using UniProt accession numbers, the package uses the UniProt API to get the features of the protein from the UniProt database. The features are assembled into a data frame and visualized using adaptations of the ggplot2 package. Visualizations can be customised in many ways including adding additional protein features information from other data frames, altering colors and protein names and adding extra layers using other ggplot2 functions. This can be completed within a script that makes the workflow reproducible and sharable.
TL;DR: Gwasurvivr, an R/Bioconductor package with a simple interface for conducting genome wide survival analyses using VCF (outputted from Michigan or Sanger imputation servers) and IMPUTE2 files, is developed and shows better scalability as sample size and number of SNPs increase.
Abstract: Researchers are increasingly interested in evaluating time-to-event outcomes such as survival in the context of genetic variation. However, there are limited software options for performing survival analyses with millions of SNPs. To address this, we developed gwasurvivr, an R/Bioconductor package to conduct fast and efficient genome wide survival analyses. gwasurvivr accepts data in VCF (as outputted from Michigan or Sanger imputation servers) and IMPUTE2 format, and provides a simple interface to run large-scale analyses. We benchmarked gwasurvivr with other GWAS software capable of conducting genome wide survival analysis (genipe, SurvivalGWAS_SV, and GWASTools) and have demonstrated improved scalability that includes shorter run times for large sample sizes and larger number of SNPs.
TL;DR: The Bioconductor package miRBaseConverter and the Shiny-based web application are presented to provide a suite of functions for converting and retrieving miRNA name, accession, sequence, species, version and family information in different versions of miR base.
Abstract: Background: miRBase is the primary repository for published miRNA sequence and annotation data, and serves as the "go-to" place for miRNA research. However, the definition and annotation of miRNAs have been changed significantly across different versions of miRBase. The changes cause inconsistency in miRNA related data between different databases and articles published at different times. Several tools have been developed for different purposes of querying and converting the information of miRNAs between different miRBase versions, but none of them individually can provide the comprehensive information about miRNAs in miRBase and users will need to use a number of different tools in their analyses. Results: We introduce miRBaseConverter, an R package integrating the latest miRBase version 22 available in Bioconductor to provide a suite of functions for converting and retrieving miRNA name (ID), accession, sequence, species, version and family information in different versions of miRBase. The package is implemented in R and available under the GPL-2 license from the Bioconductor website (http://bioconductor.org/packages/miRBaseConverter/). A Shiny-based GUI suitable for non-R users is also available as a standalone application from the package and also as a web application at http://nugget.unisa.edu.au:3838/miRBaseConverter. miRBaseConverter has a built-in database for querying miRNA information in all species and for both pre-mature and mature miRNAs defined by miRBase. In addition, it is the first tool for batch querying the miRNA family information. The package aims to provide a comprehensive and easy-to-use tool for miRNA research community where researchers often utilize published miRNA data from different sources. Conclusions: The Bioconductor package miRBaseConverter and the Shiny-based web application are presented to provide a suite of functions for converting and retrieving miRNA name, accession, sequence, species, version and family information in different versions of miRBase. The package will serve a wide range of applications in miRNA research and could provide a full view of the miRNAs of interest.
TL;DR: HPAanalyze is an R package for retrieving and performing exploratory data analysis from HPA that integrates into the R workflow via the tidyverse philosophy and data structures, and can be used in combination with Bioconductor packages for easy analysis of HPA data.
Abstract: Background The Human Protein Atlas aims to map human proteins via multiple technologies including imaging, proteomics and transcriptomics. Access of the HPA data is mainly via web-based interface allowing view of individual proteins, which may not be optimal for data analysis of a gene set, or automatic retrieval of original images. Results HPAanalyze is an R package for retrieving and performing exploratory data analysis from HPA. It provides functionality for importing data tables and xml files from HPA, exporting and visualizing data, as well as download all staining images of interest. The package is free, open source, and available via GitHub. Conclusions HPAanalyze integrates into the R workflow via the tidyverse philosophy and data structures, and can be used in combination with Bioconductor packages for easy analysis of HPA data.
TL;DR: New ELMER version that provides a new Supervised analysis mode, which uses pre-defined sample groupings and can identify additional Master Regulators, such as KLF5 in basal-like breast cancer.
Abstract: Motivation: DNA methylation can be used to identify functional changes at transcriptional enhancers and other cis-regulatory modules (CRMs) in tumors and other primary disease tissues. Our R/Bioconductor package ELMER (Enhancer Linking by Methylation/Expression Relationships) provides a systematic approach that reconstructs gene regulatory networks (GRNs) by combining methylation and gene expression data derived from the same set of samples.
Results: We present new ELMER version that provides a new Supervised analysis mode, which uses pre-defined sample groupings and can identify additional Master Regulators, such as KLF5 in basal-like breast cancer.
Availability: ELMER (v2.0) is available as an R/Bioconductor package at http://bioconductor.org/packages/ELMER/ with auxiliary data at http://bioconductor.org/packages/ELMER.data/.
TL;DR: Rqc, a Bioconductor package designed to assist the analyst during assessment of high-throughput sequencing data quality, is developed and new data quality visualization strategies are created by using established analytical procedures to improve the ability of identifying patterns that may affect downstream procedures.
Abstract: As sequencing costs drop with the constant improvements in the field, next-generation sequencing becomes one of the most used technologies in biological research. Sequencing technology allows the detailed characterization of events at the molecular level, including gene expression, genomic sequence and structural variants. Such experiments result in billions of sequenced nucleotides and each one of them is associated to a quality score. Several software tools allow the quality assessment of whole experiments. However, users need to switch between software environments to perform all steps of data analysis, adding an extra layer of complexity to the data analysis workflow. We developed Rqc, a Bioconductor package designed to assist the analyst during assessment of high-throughput sequencing data quality. The package uses parallel computing strategies to optimize large data sets processing, regardless of the sequencing platform. We created new data quality visualization strategies by using established analytical procedures. That improves the ability of identifying patterns that may affect downstream procedures, including undesired sources technical variability. The software provides a framework for writing customized reports that integrates seamlessly to the R/Bioconductor environment, including publication-ready images. The package also offers an interactive tool to generate quality reports dynamically. Rqc is implemented in R and it is freely available through the Bioconductor project (https://bioconductor.org/packages/Rqc/) for Windows, Linux and Mac OS X operating systems.
TL;DR: MWASTools is an R package designed to provide an integrated pipeline to analyse metabonomic data in large‐scale epidemiological studies and key functionalities include: quality control analysis; metabolome‐wide association analysis using various models; visualization of statistical outcomes; metabolite assignment using statistical total correlation spectroscopy (STOCSY); and biological interpretation of metabolome-wide association studies results.
Abstract: Summary MWASTools is an R package designed to provide an integrated pipeline to analyse metabonomic data in large-scale epidemiological studies. Key functionalities of our package include: quality control analysis; metabolome-wide association analysis using various models (partial correlations, generalized linear models); visualization of statistical outcomes; metabolite assignment using statistical total correlation spectroscopy (STOCSY); and biological interpretation of metabolome-wide association studies results. Availability and implementation The MWASTools R package is implemented in R (version > =3.4) and is available from Bioconductor: https://bioconductor.org/packages/MWASTools/. Contact m.dumas@imperial.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.
TL;DR: A graphical user interface (GUI) using Shiny is developed for the R/Bioconductor TCGAbiolinks package, which allows users to search, download and prepare cancer genomics data for integrative data analysis.
Abstract: The GDC (Genomic Data Commons) data portal provides users with data from cancer genomics studies. Recently, we developed the R/Bioconductor TCGAbiolinks package, which allows users to search, download and prepare cancer genomics data for integrative data analysis. The use of this package requires users to have advanced knowledge of R thus limiting the number of users. To overcome this obstacle and improve the accessibility of the package by a wider range of users, we developed a graphical user interface (GUI) using Shiny available through the package TCGAbiolinksGUI. The TCGAbiolinksGUI package is freely available within the Bioconductor project at http://bioconductor.org/packages/TCGAbiolinksGUI/. Links to the GitHub repository, a demo version of the tool, a docker image and PDF/video tutorials are available from the TCGAbiolinksGUI site.