TL;DR: The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.
Abstract: Summary:biomaRt is a new Bioconductor package that integrates BioMart data resources with data analysis software in Bioconductor. It can annotate a wide range of gene or gene product identifiers (e.g. Entrez-Gene and Affymetrix probe identifiers) with information such as gene symbol, chromosomal coordinates, Gene Ontology and OMIM annotation. Furthermore biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases (e.g. Ensembl). The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.
Availability:http://www.bioconductor.org. LGPL
Contact: steffen.durinck@esat.kuleuven.ac.be
TL;DR: MADE4 takes advantage of the extensive multivariate statistical and graphical functions in the R package ade4, extending these for application to microarray data and provides new graphical and visualization tools that aid in interpretation of multivariate analysis of micro array data.
Abstract: Summary: MADE4, microarray ade4, is a software package that facilitates multivariate analysis of microarray gene-expression data. MADE4 accepts a wide variety of gene-expression data formats. MADE4 takes advantage of the extensive multivariate statistical and graphical functions in the R package ade4, extending these for application to microarray data. In addition, MADE4 provides new graphical and visualization tools that aid in interpretation of multivariate analysis of microarray data.
Availability: The R package MADE4 is available from Bioconductor http://bioinf.vcd.ie/software and from Bioconductor http://www.bioconductor.org
Contact: aedin.culhane@ucd.ie
Supplementary information: MADE4 is well documented. There are tutorials, in the form of vignettes, which describe typical analyses. In addition, the MADE4 manual provides descriptions and examples for each function.
TL;DR: Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.
Abstract: Full four-color book.
Some of the editors created the Bioconductor project and Robert Gentleman is one of the two originators of R.
All methods are illustrated with publicly available data, and a major section of the book is devoted to fully worked case studies.
Code underlying all of the computations that are shown is made available on a companion website, and readers can reproduce every number, figure, and table on their own computers.
TL;DR: The multtest package as discussed by the authors implements widely applicable resampling-based single-step and stepwise multiple testing procedures (MTPs) for controlling a broad class of Type I error rates.
Abstract: The Bioconductor R package multtest implements widely applicable resampling-based single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates. The current version of multtest provides MTPs for tests concerning means, differences in means, and regression parameters in linear and Cox proportional hazards models. Typical testing scenarios are illustrated by applying various MTPs implemented in multtest to the Acute Lymphoblastic Leukemia (ALL) data set of Chiaretti et al. (2004), with the aim of identifying genes whose expression measures are associated with (possibly censored) biological and clinical outcomes.
TL;DR: A simple procedure is presented which combines experimental measurements with available biological information in a way that genes are simultaneously tested in groups related by common functional properties and constitutes a very sensitive tool for selecting genes with significant differential behaviour in the experimental conditions tested.
Abstract: Motivation: The analysis of genome-scale data from different high throughput techniques can be used to obtain lists of genes ordered according to their different behaviours under distinct experimental conditions corresponding to different phenotypes (e.g. differential gene expression between diseased samples and controls, different response to a drug, etc.). The order in which the genes appear in the list is a consequence of the biological roles that the genes play within the cell, which account, at molecular scale, for the macroscopic differences observed between the phenotypes studied. Typically, two steps are followed for understanding the biological processes that differentiate phenotypes at molecular level: first, genes with significant differential expression are selected on the basis of their experimental values and subsequently, the functional properties of these genes are analysed. Instead, we present a simple procedure which combines experimental measurements with available biological information in a way that genes are simultaneously tested in groups related by common functional properties. The method proposed constitutes a very sensitive tool for selecting genes with significant differential behaviour in the experimental conditions tested.
Results: We propose the use of a method to scan ordered lists of genes. The method allows the understanding of the biological processes operating at molecular level behind the macroscopic experiment from which the list was generated. This procedure can be useful in situations where it is not possible to obtain statistically significant differences based on the experimental measurements (e.g. low prevalence diseases, etc.). Two examples demonstrate its application in two microarray experiments and the type of information that can be extracted.
Availability: The software used for the association of significant Gene Ontology (GO) terms to sets of genes is available at http://www.fatigo.org and http://www.babelomics.org. Software for ranking genes according to phenotypes is available in GEPAS (http://www.gepas.org). The multtest program from the bioconductor package is available at http://www.bioconductor.org/repository/devel/package/html/multtest.html.
Contact: jdopazo@ochoa.fib.es
TL;DR: Interfaces to open source resources for visualization and network algorithms have been developed to support analysis of graphical structures in genomics and computational biology.
Abstract: Summary: In this paper, we review the central concepts and implementations of tools for working with network structures in Bioconductor. Interfaces to open source resources for visualization (AT&T Graphviz) and network algorithms (Boost) have been developed to support analysis of graphical structures in genomics and computational biology.
Availability: Packages graph, Rgraphviz, RBGL of Bioconductor (www.bioconductor.org).
Contact: stvjc@channing.harvard.edu
TL;DR: Different approaches to the identification of changes in gene expression that are associated with particular biological conditions are discussed and how they can be applied using software from the Bioconductor Project is illustrated.
Abstract: A basic, yet challenging task in the analysis of microarray gene expression data is the identification of changes in gene expression that are associated with particular biological conditions. We discuss different approaches to this task and illustrate how they can be applied using software from the Bioconductor Project. A central problem is the high dimensionality of gene expression space, which prohibits a comprehensive statistical analysis without focusing on particular aspects of the joint distribution of the genes' expression levels. Possible strategies are to do univariate gene-by-gene analysis, and to perform data-driven nonspecific filtering of genes before the actual statistical analysis. However, more focused strategies that make use of biologically relevant knowledge are more likely to increase our understanding of the data.
Keywords:
differential gene expression;
microarrays;
multiple testing;
statistical software;
biological metadata
TL;DR: A comparison of the robust neural network method with other published methods demonstrates its potential in reducing both intensity- dependent bias and spatial-dependent bias, which translates to more reliable identification of truly regulated genes.
Abstract: Motivation: Microarray experiments are affected by numerous sources of non-biological variation that contribute systematic bias to the resulting data. In a dual-label (two-color) cDNA or long-oligonucleotide microarray, these systematic biases are often manifested as an imbalance of measured fluorescent intensities corresponding to Sample A versus those corresponding to Sample B. Systematic biases also affect between-slide comparisons. Making effective corrections for these systematic biases is a requisite for detecting the underlying biological variation between samples. Effective data normalization is therefore an essential step in the confident identification of biologically relevant differences in gene expression profiles. Several normalization methods for the correction of systemic bias have been described. While many of these methods have addressed intensity-dependent bias, few have addressed both intensity-dependent and spatiality-dependent bias.
Results: We present a neural network-based normalization method for correcting the intensity- and spatiality-dependent bias in cDNA microarray datasets. In this normalization method, the dependence of the log-intensity ratio (M) on the average log-intensity (A) as well as on the spatial coordinates (X,Y) of spots is approximated with a feed-forward neural network function. Resistance to outliers is provided by assigning weights to each spot based on how distant their M values is from the median over the spots whose A values are similar, as well as by using pseudospatial coordinates instead of spot row and column indices. A comparison of the robust neural network method with other published methods demonstrates its potential in reducing both intensity-dependent bias and spatial-dependent bias, which translates to more reliable identification of truly regulated genes.
Availability: The normalization method described in this paper is available as the library nnNorm in the BioConductor project (http://www.bioconductor.org). Scripts used to load the freely available data and generate some of the figures in this paper are available in the documentation accompanying this library.
Contact: ltarca@rsvs.ulaval.ca
TL;DR: This chapter discusses strategies for geneat-a-time analyses, nonspecific and meta-data driven prefiltering techniques, and commonly used test statistics for detecting differential expression, and demonstrates the use of factorial models for probing complex biological systems.
Abstract: In this chapter, we focus on the analysis of differential gene expression studies. Many microarray studies are designed to detect genes associated with different phenotypes, for example, the comparison of cancer tumors and normal cells. In some multifactor experiments, genetic networks are perturbed with various treatments to understand the effects of those treatments and their interactions with each other in the dynamic cellular network. For even the simplest experiments, investigators must consider several issues for appropriate gene selection. We discuss strategies for geneat-a-time analyses, nonspecific and meta-data driven prefiltering techniques, and commonly used test statistics for detecting differential expression. We show how these strategies and statistical tools are implemented and used in Bioconductor. We also demonstrate the use of factorial models for probing complex biological systems and highlight the importance of carefully coordinating known cellular behavior with statistical modeling to make biologically relevant inference from microarray studies.
TL;DR: In this article, the authors demonstrate Bioconductor tools useful for creating such lists, starting from the raw probe level data (CEL files) and conclude with the creation of annotated reports.
Abstract: One of the most popular applications of microarray technology is the identification of genes that are differentially expressed in two populations.With Affymetrix GeneChip technology, there are several steps between hybridization and the selection of interesting genes. The steps of preprocessing to improve signal to noise ratios, choosing a summary statistic for appropriate ranking of genes, and deciding on a final filter for candidate genes are largely statistical in nature. In this chapter, we demonstrate Bioconductor tools useful for creating such lists. We start from the raw probe level data (CEL files) and conclude with the creation of annotated reports.
TL;DR: The most widely used families of machine learning methods are described, along with various approaches to learner assessment, and key problems of model selection and interpretation are reviewed in examples.
Abstract: In this chapter, supervised machine learning methods are described in the context of microarray applications. The most widely used families of machine learning methods are described, along with various approaches to learner assessment. The Bioconductor interfaces to machine learning tools are described and illustrated. Key problems of model selection and interpretation are reviewed in examples.
TL;DR: This chapter demonstrates Bioconductor tools useful for creating lists of genes that are differentially expressed in two populations and starts from the raw probe level data (CEL files) and concludes with the creation of annotated reports.
Abstract: The predominant use for microarrays is the measurement of genome-wide expression levels, and the most commonly used microarray platform is the Affymetrix GeneChip Affymetrix GeneChip arrays use short oligonucleotides to probe for genes in an RNA sample Genes are represented by a set of oligonucleotide probes each with a length of 25 bases Because of their short length, multiple probes are used to improve specificity Affymetrix arrays typically use between 11 and 20 probe pairs, referred to as a probeset, for each gene One component of these pairs is referred to as a perfect match probe (PM) and is designed to hybridize only with transcripts from the intended gene (specific hybridization) However, hybridization to the PM probes by other mRNA species (non-specific hybridization) is unavoidable Therefore, the observed intensities need to be adjusted to be accurately quantified The other component of a probe pair, the mismatch probe (MM), is constructed with the intention of measuring only the nonspecific component of the corresponding PM probe Affymetrix’s strategy is to make MM probes identical to their PM counterpart except that the 13-th base is exchanged with its complement The identification of genes that are differentially expressed in two populations is a popular application of Affymetrix GeneChip technology Due to the cost of this technology, experiments using a small number of arrays are common A situation we often see is the case where three arrays are used for each population In this lab, we give an example of how to quickly create lists of genes that are interesting in the sense that they appear to be differentially expressed, starting from the raw probe level data (CEL files) In Section 2, we briefly describe the functions necessary to import the data into Bioconductor In Section 3 we talk about preprocessing In Section 4, we describe ways to rank genes and decide on a cutoff Finally, in Section 5 we describe how to make annotated reports and examine the PubMed literature related to the genes in our list
TL;DR: This chapter describes software tools for creating, manipulating, and visualizing graphs in the Bioconductor project and gives the rationale for the design decisions and brief outlines of how to make use of these tools.
Abstract: We describe software tools for creating, manipulating, and visualizing graphs in the Bioconductor project. We give the rationale for our design decisions and provide brief outlines of how to make use of these tools. The discussion mirrors that of Chapter 20 where the different mathematical constructs were described. It is worth differentiating between packages that are mainly infrastructure (sets of tools that can be used to create other pieces of software) and packages that are designed to provide an end-user application. The packages graph, RBGL, and Rgraphviz are infrastructure packages. Software developers may use these packages to construct tools aimed at specific applications areas, such as the GOstats package.
TL;DR: Stam, a computational tool for semi-supervised molecular disease entity detection, automatically discovers molecular heterogeneities in phenotypically defined disease entities and suggests alternative molecular sub-entities of clinical phenotypes using both gene expression data and functional gene annotations.
Abstract: Genome wide microarray studies have the potential to unveil novel disease entities. Clinically homogeneous groups of patients can have diverse gene expression profiles. The definition of novel subclasses based on gene expression is a difficult problem not addressed systematically by currently available software tools. We present a computational tool for semi-supervised molecular disease entity detection. It automatically discovers molecular heterogeneities in phenotypically defined disease entities and suggests alternative molecular sub-entities of clinical phenotypes. This is done using both gene expression data and functional gene annotations. We provide stam, a Bioconductor compliant software package for the statistical programming environment R. We demonstrate that our tool detects gene expression patterns, which are characteristic for only a subset of patients from an established disease entity. We call such expression patterns molecular symptoms. Furthermore, stam finds novel sub-group stratifications of patients according to the absence or presence of molecular symptoms. Our software is easy to install and can be applied to a wide range of datasets. It provides the potential to reveal so far indistinguishable patient sub-groups of clinical relevance.
TL;DR: This chapter will discuss the appropriate circumstances under which webbioc should be deployed and the pros and cons of using it versus the typical command line environment of R.
Abstract: webbioc is a CGI-based interface to Bioconductor methods for preprocessing and analyzing Affymetrix data. It wraps up the functionality of a number of Bioconductor packages into a consistent environment that can be deployed for use by small groups or large departments. Without ever seeing a command prompt, it will take the user from raw data to annotated lists of the most significantly differentially expressed genes. It will optionally make use of a back-end computer cluster for batch processing. This chapter will discuss the appropriate circumstances under which webbioc should be deployed and the pros and cons of using it versus the typical command line environment of R. Installation and configuration will be fully covered. Use of theWeb-based interface will be visually demonstrated. Finally, we will describe how to expand the interface by adding additional analysis modules.
TL;DR: The requirements, language features, and methodology of design and development guiding the evolution of this project are described, which are expected to foster the propagation of standards of transparency and explicit reproducibility from wet-lab science, to in silico biology, where explicit reproduction of important published results is often very difficult.
Abstract: Bioconductor is an open source initiative for the creation and dissemination of methods in statistical genomics and computational biology based on R. This article describes the requirements, language features, and methodology of design and development guiding the evolution of this project. Commitments to software interoperability, computable task-oriented documentation, and full transparency of algorithm development and use are found to be valuable in reducing barriers to access faced by statistical, computational, or biological researchers attempting interdisciplinary work. These commitments are expected to foster the propagation of standards of transparency and explicit reproducibility from wet-lab science, where they are well accepted, to in silico biology, where explicit reproduction of important published results is often very difficult.
Keywords:
computational biology;
open source software;
object-oriented programming;
documentation;
network algorithms;
software quality assurance;
reproducible research
TL;DR: This section considers some of the different sources of biological information as well as the software tools that can be used to access these data and to integrate them into an analysis.
Abstract: Closing the gap between knowledge of sequence and knowledge of function requires aggressive, integrative use of biological research databases of many different types. For greatest effectiveness, analysis processes and interpretation of analytic results must be guided using relevant knowledge about the systems under investigation. However, this knowledge is often widely scattered and encoded in a variety of formats. In this section, we consider some of the different sources of biological information as well as the software tools that can be used to access these data and to integrate them into an analysis. Bioconductor provides tools for creating, distributing, and accessing annotation resources in ways that have been found effective in workflows for statistical analysis of microarray and other high-throughput assays.
TL;DR: The software package provides four clustering algorithms and GeneOntology terms as prototype annotation data and the functional analysis is based on the hypergeometric distribution whereby the Bonferroni correction or the false discovery rate can be used to correct for multiple testing.
Abstract: Motivation: Several tools that facilitate the interpretation of transcriptional profiles using gene annotation data are available but most of them combine a particular statistical analysis strategy with functional information. goCluster extends this concept by providing a modular framework that facilitates integration of statistical and functional microarray data analysis with data interpretation.
Results: goCluster enables scientists to employ annotation information, clustering algorithms and visualization tools in their array data analysis and interpretation strategy. The package provides four clustering algorithms and GeneOntology terms as prototype annotation data. The functional analysis is based on the hypergeometric distribution whereby the Bonferroni correction or the false discovery rate can be used to correct for multiple testing. The approach implemented in goCluster was successfully applied to interpret the results of complex mammalian and yeast expression data obtained with high density oligonucleotide microarrays (GeneChips).
Availability: goCluster is available via the BioConductor portal at www.bioconductor.org. The software package, detailed documentation, user- and developer guides as well as other background information are also accessible via a web portal at http://www.bioz.unibas.ch/gocluster.
Contact: michael.primig@unibas.ch
TL;DR: MIDAW (microarray data analysis web tool) is a web interface integrating a series of statistical algorithms that can be used for processing and interpretation of microarray data.
Abstract: MIDAW (microarray data analysis web tool) is a web interface integrating a series of statistical algorithms that can be used for processing and interpretation of microarray data. MIDAW consists of two main sections: data normalization and data analysis. In the normalization phase the simultaneous processing of several experiments with background correction, global and local mean and variance normalization are carried out. The data analysis section allows graphical display of expression data for descriptive purposes, estimation of missing values, reduction of data dimension, discriminant analysis and identification of marker genes. The statistical results are organized in dynamic web pages and tables, where the transcript/gene probes contained in a specific microarray platform can be linked (according to user choice) to external databases (GenBank, Entrez Gene, UniGene). Tutorial files help the user throughout the statistical analysis to ensure that the forms are filled out correctly. MIDAW has been developed using Perl and PHP and it uses R/Bioconductor languages and routines. MIDAW is GPL licensed and freely accessible at http://muscle.cribi.unipd.it/midaw/. Perl and PHP source codes are available from the authors upon request.
TL;DR: Mark et al. as discussed by the authors proposed a distance synthesis scheme for identifying differentially expressed genes using a set of spike-in datasets, in which known genes are known, and demonstrated that their method compares favorably with the best individual statistics, while achieving robustness properties lacked by the individual statistics.
Abstract: Motivation: A common objective of microarray experiments is the detection of differential gene expression between samples obtained under different conditions. The task of identifying differentially expressed genes consists of two aspects: ranking and selection. Numerous statistics have been proposed to rank genes in order of evidence for differential expression. However, no one statistic is universally optimal and there is seldom any basis or guidance that can direct toward a particular statistic of choice.
Results: Our new approach, which addresses both ranking and selection of differentially expressed genes, integrates differing statistics via a distance synthesis scheme. Using a set of (Affymetrix) spike-in datasets, in which differentially expressed genes are known, we demonstrate that our method compares favorably with the best individual statistics, while achieving robustness properties lacked by the individual statistics. We further evaluate performance on one other microarray study.
Availability: The approach is implemented in an R package called DEDS, which is available for download from the Bioconductor website (http://www.bioconductor.org/).
Contact: mark@biostat.ucsf.edu
TL;DR: A novel algorithm called Structured Analysis of Microarrays (StAM), which accounts for molecular heterogeneity of complex clinical phenotypes and goes beyond established methodology in several aspects: in addition to the expression data, it exploits functional annotations from the Gene Ontology database to build biologically focussed classifiers.
Abstract: Motivation: Today, the characterization of clinical phenotypes by gene-expression patterns is widely used in clinical research. If the investigated phenotype is complex from the molecular point of view, new challanges arise and these have not been adressed systematically. For instance, the same clinical phenotype can be caused by various molecular disorders, such that one observes different characteristic expression patterns in different patients.
Results: In this paper we describe a novel algorithm called Structured Analysis of Microarrays (StAM), which accounts for molecular heterogeneity of complex clinical phenotypes. Our algorithm goes beyond established methodology in several aspects: in addition to the expression data, it exploits functional annotations from the Gene Ontology database to build biologically focussed classifiers. These are used to uncover potential molecular disease subentities and associate them to biological processes without compromising overall prediction accuracy.
Availability: Bioconductor compliant R package
Contact: Claudio.Lottaz@molgen.mpg.de
Supplementary information: Complete analyses are available at http://compdiag.molgen.mpg.de/supplements/lottaz05
TL;DR: This chapter begins by describing how to import probe-level data into the system and how these data can be examined using the facilities of the AffyBatch class, and describes background adjustment, normalization, and summarization methods.
Abstract: High-density oligonucleotide expression arrays are a widely used microarray platform. Affymetrix GeneChip arrays dominate this market. An important distinction between the GeneChip and other technologies is that on GeneChips, multiple short probes are used to measure gene expression levels. This makes preprocessing particularly important when using this platform. This chapter begins by describing how to import probe-level data into the system and how these data can be examined using the facilities of the AffyBatch class. Then we will describe background adjustment, normalization, and summarization methods. Functionality for GeneChip probe-level data is provided by the affy, affyPLM, affycomp, gcrma, and affypdnn packages. All these tools are useful for preprocessing probe-level data stored in an AffyBatch object into expression-level data stored in an exprSet object. Because there are many competing methods for this preprocessing step, it is useful to have a way to assess the differences. In Bioconductor, this can be carried out using the affycomp package, which we discuss briefly.
TL;DR: Simpleaffy is a BioConductor package that provides access to a variety of QC metrics for assessing the quality of RNA samples and of the intermediate stages of sample preparation and hybridization.
Abstract: Summary: Quality Control is a fundamental aspect of successful microarray data analysis. Simpleaffy is a BioConductor package that provides access to a variety of QC metrics for assessing the quality of RNA samples and of the intermediate stages of sample preparation and hybridization. Simpleaffy also offers fast implementations of popular algorithms for generating expression summaries and detection calls.
Availability: Simpleaffy can be downloaded from http://www.bioconductor.org
Contact: cmiller@picr.man.ac.uk
Supplementary information: Additional information can be found on the supplementary website located at http://bioinformatics.picr.man.ac.uk
TL;DR: twilight as mentioned in this paper is a Bioconductor compatible package for analysing the statistical significance of differentially expressed genes, which is based on the concept of the local false discovery rate (FDR), a generalization of the frequently used global FDR.
Abstract: Summary: twilight is a Bioconductor compatible package for analysing the statistical significance of differentially expressed genes. It is based on the concept of the local false discovery rate (FDR), a generalization of the frequently used global FDR. twilight implements the heuristic search algorithm for estimating the local FDR introduced in our earlier work. In addition to the raw significance measures, it produces diagnostic plots, which provide insight into the extent of differential expression across genes.
Availability: http://www.bioconductor.org
Contact: stefanie.scheid@molgen.mpg.de
Supplementary information: Please visit our software webpage on http://compdiag.molgen.mpg.de/software
TL;DR: This work developed an online microarray data analysis platform, WebArray, for bench biologists to utilize these tools to explore data from single/dual color microarray experiments, and provides a user-friendly interface for accessing a wide range of key functions of limma and others.
Abstract: Background
Many cutting-edge microarray analysis tools and algorithms, including commonly used limma and affy packages in Bioconductor, need sophisticated knowledge of mathematics, statistics and computer skills for implementation. Commercially available software can provide a user-friendly interface at considerable cost. To facilitate the use of these tools for microarray data analysis on an open platform we developed an online microarray data analysis platform, WebArray, for bench biologists to utilize these tools to explore data from single/dual color microarray experiments.