TL;DR: Three of the most widely used and comprehensive statistical analysis tools written in R through the Bioconductor project are reviewed, the Java-based TM4 software system available from The Institute for Genomic Research, and BASE, the Web-based system developed at Lund University.
Abstract: DNA microarray assays represent the first widely used application that attempts to build upon the information provided by genome projects in the study of biological questions. One of the greatest challenges with working with microarrays is collecting, managing, and analyzing data. Although several commercial and noncommercial solutions exist, there is a growing body of freely available, open source software that allows users to analyze data using a host of existing techniques and to develop their own and integrate them within the system. Here we review three of the most widely used and comprehensive systems, the statistical analysis tools written in R through the Bioconductor project (http://www.bioconductor.org), the Java-based TM4 software system available from The Institute for Genomic Research (http://www.tigr.org/software), and BASE, the Web-based system developed at Lund University (http://base.thep.lu.se).
TL;DR: The system currently provides parsers to process annotation data from LocusLink, Gene Ontology Consortium, and Human Gene Project and can be extended to new data sources via user defined parsers.
Abstract: SUMMARY AnnBuilder is an R package for assembling genomic annotation data. The system currently provides parsers to process annotation data from LocusLink, Gene Ontology Consortium, and Human Gene Project and can be extended to new data sources via user defined parsers. AnnBuilder differs from other existing systems in that it provides users with unlimited ability to assemble data from user selected sources. The products of AnnBuilder are files in XML format that can be easily used by different systems. AVAILABILITY (http://www.bioconductor.org). Open source.
TL;DR: The model presented addresses the need for annotation sources identified during BioConductor inverted exclamation mark s development, and provides well-curated genomic background knowledge for expression array analysis and interpretation.
Abstract: This paper presents the implementation of a model for expression array annotation (EAA) using the BioMediator biological data integration system along with BioConductor, an analytic tools platform. The model presented addresses the need for annotation sources identified during BioConductor’s development. Annotation provides us with well-curated genomic background knowledge for expression array analysis and interpretation. Annotation requests are constructed and posted to the query interface of the EAA package (the EAA model implemented as a component of BioConductor). The software enumerates all possible annotation paths for queries. These are then transformed to PQL queries and processed by BioMediator. Annotation entities returned from the EAA package answer the annotation request.
TL;DR: ROOT is introduced, an objectoriented framework that has been developed at CERN for distributed data warehousing and data mining of particle data in the petabyte range, and how R could be easily extended to access ROOT from within R is emphasized.
Abstract: Novel high-throughput technologies such as DNA microarray analyses are allowing biologists to generate sets of data in the terabyte realm. Many of these data will be deposited in the public domain, necessitating a common standard. Currently available database systems are not appropriate for these intentions. In this paper, I will introduce ROOT (http://root.cern.ch), an objectoriented framework that has been developed at CERN for distributed data warehousing and data mining of particle data in the petabyte range. Data are stored as sets of objects in machine-independent files, and specialized methods are used to get direct access to separate attributes of selected data objects. ROOT has been designed in such a way that it can query its databases in parallel on SMP/MPP machines, on clusters of PC’s, or using common GRID services. In order to demonstrate the applicability of ROOT to microarray data, I will present a functional prototype system, called XPS - eXpression Profiling System, which can be considered to be an alternative to the Bioconductor project. The current implementation handles the storage of Aymetrix GeneChip schemes and data, and the pre-processing, normalization and filtering of GeneChip data. Based on this system, I will propose a novel standard for the distributed storage of microarray data. Finally, I will emphasize the similarities between R and ROOT, and show how R could be easily extended to access ROOT from within R.
TL;DR: In order to extract valuable information from the big amount of data that microarrays experiments generate, suitable and powerful statistical and computational methods are required.
Abstract: Microarray technology is becoming an essential tool in functional genomics. The possibility of monitoring the expression level of thousands of genes simultaneously, as the response to a particular biological condition, gives to the biologists the chance to widen the aims of their experiments and opens a door to the understanding of cellular transcription processes. In order to extract valuable information from the big amount of data that microarrays experiments generate, suitable and powerful statistical and computational methods are required. An example of the eort of statisticians and computer scientists is the release of the first Bioconductor software and the increasing number of functions for microarray data analysis implemented
TL;DR: The role played by RDBMS in Bioconductor is less pronounced than had been anticipated, but this will change as requirements for query optimization, data structure standardization, and greater volumes of data and metadata emerge.
Abstract: Bioconductor (http://www.bioconductor.org/) is an open source collection of resources aimed at transparently advancing the theory and practice of bioinformatics, with a focus on expression arrays and the R statistical computing environment. I will sketch the key data structures and data flow processes addressed in Bioconductor thus far. I will review the role played by RDBMS in the development and curation of packaged annotation networks and in the analysis of Serial Analysis of Gene Expression (SAGE) libraries. Non-relational database technologies such as BerkeleyDB and HDF5 have also played a role in tools for archiving and navigating expression array data. At present the role of RDBMS in Bioconductor is less pronounced than had been anticipated. This will change as requirements for query optimization, data structure standardization, and greater volumes of data and metadata emerge.
TL;DR: This chapter describes a collection of four R packages for exploratory analysis and normalization of two-color cDNA microarray fluorescence intensity data, developed as part of the Bioconductor project, to produce an open-source and open-development statistical computing framework for the analysis of genomic data.
Abstract: This chapter describes a collection of four R packages for exploratory analysis and normalization of two-color cDNA microarray fluorescence intensity data. R’s object-oriented class/method mechanism is exploited to allow efficient and systematic representation and manipulation of large microarray datasets of multiple types. The marrayClasses package contains class definitions and associated methods for pre- and postnormalization intensity data for batches of arrays. The marrayInput package provides functions and tcltk widgets to automate data input and the creation of microarray-specific R objects for storing these data. Functions for diagnostic plots of microarray spot statistics, such as boxplots, scatterplots, and spatial color images, are provided in marrayPlots. Finally, the marrayNorm package implements robust adaptive location and scale normalization procedures, which correct for different types of dye biases (e.g., intensity, spatial, plate biases) and allow the use of control sequences spotted onto the array and possibly spiked into the mRNA samples. The four new packages were developed as part of the Bioconductor project, which aims more generally to produce an open-source and open-development statistical computing framework for the analysis of genomic data.
TL;DR: An extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe-level data and some examples demonstrating that having access to and methods for probelevel data results in improvements to quality control assessments, normalization, and expression measures are provided.
Abstract: We describe an extensible, interactive environment for data analysis and exploration of Affymetrix oligonucleotide array probe-level data. The software utilities provided with the Affymetrix analysis suite summarize the probe set intensities and makes available only one expression measure for each gene. We have developed this package because much can be learned from studying the individual probe intensities or, as we call them, the probe-level data. We provide some examples demonstrating that having access to and methods for probelevel data results in improvements to quality control assessments, normalization, and expression measures. The software is implemented as an add-on package, conveniently named affy, to the freely available and widely used statistical language/software R (Ihaka and Gentleman, 1996). The development of this software as an add-on to R allows us to take advantage of the basic mathematical and statistical functions and powerful graphics capabilities that are provided with R. Our package is distributed as open source code for Linux, Unix, and Microsoft Windows. It is is released under the GNU General Public License. It is part of the Bioconductor project and can be obtained from http://www.bioconductor.org.
TL;DR: This paper derives and validate an estimator for the parameters of a transformation for the joint calibration (normalization) and variance stabilization of microarray intensity data and finds that the error decreases with the square root of the number of probes per array and that the estimation is robust against the presence of differentially expressed genes.
Abstract: We derive and validate an estimator for the parameters of a transformation for the joint calibration (normalization) and variance stabilization of microarray intensity data. With this, the variances of the transformed intensities become approximately independent of their expected values. The transformation is similar to the logarithm in the high intensity range, but has a smaller slope for intensities close to zero. Applications have shown better sensitivity and specificity for the detection of differentially expressed genes. In this paper, we describe the theoretical aspects of the method. We incorporate calibration and variance-mean dependence into a statistical model and use a robust variant of the maximum-likelihood method to estimate the transformation parameters. Using simulations, we investigate the size of the estimation error and its dependence on sample size and the presence of outliers. We find that the error decreases with the square root of the number of probes per array and that the estimation is robust against the presence of differentially expressed genes. Software is publicly available as an R package through the Bioconductor project (http://www.bioconductor.org).