TL;DR: A software tool is developed that allows access to the wealth of information within GEO directly from BioConductor, eliminating many the formatting and parsing problems that have made such analyses labor-intensive in the past.
Abstract: UNLABELLED Microarray technology has become a standard molecular biology tool. Experimental data have been generated on a huge number of organisms, tissue types, treatment conditions and disease states. The Gene Expression Omnibus (Barrett et al., 2005), developed by the National Center for Bioinformatics (NCBI) at the National Institutes of Health is a repository of nearly 140,000 gene expression experiments. The BioConductor project (Gentleman et al., 2004) is an open-source and open-development software project built in the R statistical programming environment (R Development core Team, 2005) for the analysis and comprehension of genomic data. The tools contained in the BioConductor project represent many state-of-the-art methods for the analysis of microarray and genomics data. We have developed a software tool that allows access to the wealth of information within GEO directly from BioConductor, eliminating many the formatting and parsing problems that have made such analyses labor-intensive in the past. The software, called GEOquery, effectively establishes a bridge between GEO and BioConductor. Easy access to GEO data from BioConductor will likely lead to new analyses of GEO data using novel and rigorous statistical and bioinformatic tools. Facilitating analyses and meta-analyses of microarray data will increase the efficiency with which biologically important conclusions can be drawn from published genomic data. AVAILABILITY GEOquery is available as part of the BioConductor project.
TL;DR: The capabilities of GOstats, a Bioconductor package written in R, that allows users to test GO terms for over or under-representation using either a classical hypergeometric test or a conditionalhypergeometric that uses the relationships among GO terms to decorrelate the results are discussed.
Abstract: Motivation: Functional analyses based on the association of Gene Ontology (GO) terms to genes in a selected gene list are useful bioinformatic tools and the GOstats package has been widely used to perform such computations. In this paper we report significant improvements and extensions such as support for conditional testing.
Results: We discuss the capabilities of GOstats, a Bioconductor package written in R, that allows users to test GO terms for over or under-representation using either a classical hypergeometric test or a conditional hypergeometric that uses the relationships among GO terms to decorrelate the results.
Availability: GOstats is available as an R package from the Bioconductor project: http://bioconductor.org
Contact: [email protected]
TL;DR: In this article, the authors present a Bioinformatics and Computational Biology Solutions Using R and Bioconductor (BIBOS) using R and BIBOS, which is a combination of R and CRF.
Abstract: (2007). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Journal of the American Statistical Association: Vol. 102, No. 477, pp. 388-389.
TL;DR: PcaMethods is a Bioconductor compliant library for computing principal component analysis (PCA) on incomplete data sets that can be analyzed directly or used to estimate missing values to enable the use of missing value sensitive statistical methods.
Abstract: Summary:pcaMethods is a Bioconductor compliant library for computing principal component analysis (PCA) on incomplete data sets. The results can be analyzed directly or used to estimate missing values to enable the use of missing value sensitive statistical methods. The package was mainly developed with microarray and metabolite data sets in mind, but can be applied to any other incomplete data set as well.
Availability: http://www.bioconductor.org
Contact: selbig@mpimp-golm.mpg.de
Supplementary information: Please visit our webpage at http://bioinformatics.mpimp-golm.mpg.de/
TL;DR: A hybrid approach to obtain the P-value of the test statistic in linear time is presented and it is shown that the substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed.
Abstract: Motivation: Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number. The algorithm tests for change-points using a maximal t-statistic with a permutation reference distribution to obtain the corresponding P-value. The number of computations required for the maximal test statistic is O(N2), where N is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster algorithm.
Results: We present a hybrid approach to obtain the P-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analyses of array CGH data from breast cancer cell lines to show the impact of the new approaches on the analysis of real data.
Availability: An R version of the CBS algorithm has been implemented in the "DNAcopy" package of the Bioconductor project. The proposed hybrid method for the P-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.
Contact: venkatre@mskcc.org
Supplementary information: Supplementary data are available at Bioinformatics online.
TL;DR: The Bioconductor project as discussed by the authors is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics, which aims to foster collaborative development and widespread use of innovative software, reduce barriers to entry into interdisciplinary scientific research, and promote the achievement of remote reproducibility of research results.
Abstract: The Bioconductor project is an initiative for the collaborative creation of the extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methodes, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
TL;DR: A preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease is described.
Abstract: SUMMARY In most microarray technologies, a number of critical steps are required to convert raw intensity measurements into the data relied upon by data analysts, biologists, and clinicians. These data manipulations, referred to as preprocessing, can influence the quality of the ultimate measurements. In the last few years, the high-throughput measurement of gene expression is the most popular application of microarray technology. For this application, various groups have demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of the gene expression measurements, relative to ad hoc procedures introduced by designers and manufacturers of the technology. Currently, other applications of microarrays are becoming more and more popular. In this paper, we describe a preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease. In particular, we describe a methodology useful for preprocessing Affymetrix single-nucleotide polymorphism chips and obtaining genotype calls with the preprocessed data. We demonstrate how our procedure improves existing approaches using data from 3 relatively large studies including the one in which large numbers of independent calls are available. The proposed methods are implemented in the package oligo available from Bioconductor.
TL;DR: X:Map is a genome annotation database that provides information needed to associate each reporter on the exon array with the features of the genome it is targeting, and to relate these to gene and genome structure.
Abstract: Affymetrix exon arrays aim to target every known and predicted exon in the human, mouse or rat genomes, and have reporters that extend beyond protein coding regions to other areas of the transcribed genome. This combination of increased coverage and precision is important because a substantial proportion of protein coding genes are predicted to be alternatively spliced, and because many non-coding genes are known also to be of biological significance. In order to fully exploit these arrays, it is necessary to associate each reporter on the array with the features of the genome it is targeting, and to relate these to gene and genome structure. X:Map is a genome annotation database that provides this information. Data can be browsed using a novel Google-maps based interface, and analysed and further visualized through an associated BioConductor package. The database can be found at http://xmap.picr.man.ac.uk.
TL;DR: A general probabilistic framework for combining high-throughput genomic data from several related microarray experiments using mixture models, which considers two methods for estimation of an index termed the probability of expression (POE).
Abstract: Background: With the explosion in data generated using microarray technology by different investigators working on similar experiments, it is of interest to combine results across multiple studies. Results: In this article, we describe a general probabilistic framework for combining highthroughput genomic data from several related microarray experiments using mixture models. A key feature of the model is the use of latent variables that represent quantities that can be combined across diverse platforms. We consider two methods for estimation of an index termed the probability of expression (POE). The first, reported in previous work by the authors, involves Markov Chain Monte Carlo (MCMC) techniques. The second method is a faster algorithm based on the expectation-maximization (EM) algorithm. The methods are illustrated with application to a meta-analysis of datasets for metastatic cancer. Conclusion: The statistical methods described in the paper are available as an R package, metaArray 1.8.1, which is at Bioconductor, whose URL is http://www.bioconductor.org/.
TL;DR: In this paper, specific procedures for conducting quality assessment of Affymetrix GeneChip(R) soybean genome data and for performing analyses to determine differential gene expression using the open-source R programming environment in conjunction with the open source Bioconductor software are described.
Abstract: This article describes specific procedures for conducting quality assessment of Affymetrix GeneChip(R) soybean genome data and for performing analyses to determine differential gene expression using the open-source R programming environment in conjunction with the open-source Bioconductor software. We describe procedures for extracting those Affymetrix probe set IDs related specifically to the soybean genome on the Affymetrix soybean chip and demonstrate the use of exploratory plots including images of raw probe-level data, boxplots, density plots and M versus A plots. RNA degradation and recommended procedures from Affymetrix for quality control are discussed. An appropriate probe-level model provides an excellent quality assessment tool. To demonstrate this, we discuss and display chip pseudo-images of weights, residuals and signed residuals and additional probe-level modeling plots that may be used to identify aberrant chips. The Robust Multichip Averaging (RMA) procedure was used for background correction, normalization and summarization of the AffyBatch probe-level data to obtain expression level data and to discover differentially expressed genes. Examples of boxplots and MA plots are presented for the expression level data. Volcano plots and heatmaps are used to demonstrate the use of (log) fold changes in conjunction with ordinary and moderated t-statistics for determining interesting genes. We show, with real data, how implementation of functions in R and Bioconductor successfully identified differentially expressed genes that may play a role in soybean resistance to a fungal pathogen, Phakopsora pachyrhizi. Complete source code for performing all quality assessment and statistical procedures may be downloaded from our web source: http://css.ncifcrf.gov/services/download/MicroarraySoybean.zip.
TL;DR: This work uses a 'top-down' approach to perform domain aggregation by first combining gene expressions before testing for differentially expressed patterns, in contrast to the more standard 'bottom-up' approach.
Abstract: Motivation: New biological systems technologies give scientists the ability to measure thousands of bio-molecules including genes, proteins, lipids and metabolites. We use domain knowledge, e.g. the Gene Ontology, to guide analysis of such data. By focusing on domain-aggregated results at, say the molecular function level, increased interpretability is available to biological scientists beyond what is possible if results are presented at the gene level.
Results: We use a ‘top–down’ approach to perform domain aggregation by first combining gene expressions before testing for differentially expressed patterns. This is in contrast to the more standard ‘bottom–up’ approach, where genes are first tested individually then aggregated by domain knowledge. The benefits are greater sensitivity for detecting signals. Our method, domain-enhanced analysis (DEA) is assessed and compared to other methods using simulation studies and analysis of two publicly available leukemia data sets.
Availability: Our DEA method uses functions available in R ( http://www.r-project.org/) and SAS (http://www.sas.com/). The two experimental data sets used in our analysis are available in R as Bioconductor packages, ‘ALL’ and ‘golubEsets’ (http://www.bioconductor.org/).
Contact: jliu6@stat.ncsu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
TL;DR: The availability of a package for reading and analyzing data from GE Healthcare Gene Expression Bioarrays within the R environment is reported, which is implemented in the R language and available for download free of charge.
Abstract: Motivation: Microarray-based expression profiles have become a standard methodology in any high-throughput analysis. Several commercial platforms are available, each with its strengths and weaknesses. The R platform for statistical analysis and graphics is a powerful environment for the analysis of microarray data, because it has many integrated statistical methods available as well as the specialized microarray analysis project Bioconductor. Many packages have been added in the last few years increasing the range of possible analysis. Here, we report the availability of a package for reading and analyzing data from GE Healthcare Gene Expression Bioarrays within the R environment. Availability: The software is implemented in the R language, is open source and available for download free of charge through the Bioconductor (http://www.bioconductor.org) project. Contact: diez@kuicr.kyoto-u.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online.
TL;DR: The R package SNPchip contains classes and methods useful for storing, visualizing and analyzing high density SNP data, including the ability to build statistical models for SNP-level data that operate on instances of the class, and to communicate with other R packages that add additional functionality.
Abstract: Summary: High-density single nucleotide polymorphism microarrays (SNP chips) provide information on a subject's genome, such as copy number and genotype (heterozygosity/homozygosity) at a SNP. While fluorescence in situ hybridization and karyotyping reveal many abnormalities, SNP chips provide a higher resolution map of the human genome that can be used to detect, e.g., aneuploidies, microdeletions, microduplications and loss of heterozygosity (LOH). As a variety of diseases are linked to such chromosomal abnormalities, SNP chips promise new insights for these diseases by aiding in the discovery of such regions, and may suggest targets for intervention. The R package SNPchip contains classes and methods useful for storing, visualizing and analyzing high density SNP data. Originally developed from the SNPscan web-tool, SNPchip utilizes S4 classes and extends other open source R tools available at Bioconductor. This has numerous advantages, including the ability to build statistical models for SNP-level data that operate on instances of the class, and to communicate with other R packages that add additional functionality.
Availability: The package is available from the Bioconductor web page at www.bioconductor.org
Contact: ingo@jhu.edu
Supplementary information: The supplementary material as described in this article (case studies, installation guidelines and R code) is available from http://biostat.jhsph.edu/~iruczins/publications/sm/
TL;DR: This work has shown that chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is a high-throughput assay for DNA-protein-binding or post-translational chromatin/histone modifications that needs to be bioinformatically annotated and compared to related datasets by statistical methods.
TL;DR: The central concepts and implementation of data structures and methods for studying genetics of gene expression with the GGtools package of Bioconductor are reviewed.
Abstract: Summary: This paper reviews the central concepts and implementation of data structures and methods for studying genetics of gene expression with the GGtools package of Bioconductor. Illustration with a HapMap+expression dataset is provided.
Availability: Package GGtools is part of Bioconductor 1.9 (http://bioconductor.org). Open source with Artistic License.
Contact: stvjc@channing.harvard.edu
TL;DR: A new Bioconductor package 'CALIB' for normalization of two-color microarray data is described, based on the measurements of external controls and estimates an absolute target level for each gene and condition pair, as opposed to working with log-ratios as a relative measure of expression.
Abstract: In this article we describe a new Bioconductor package ‘CALIB’ for normalization of two-color microarray data. This approach is based on the measurements of external controls and estimates an absolute target level for each gene and condition pair, as opposed to working with log-ratios as a relative measure of expression. Moreover, this method makes no assumptions regarding the distribution of gene expression divergence. Availability: http://bioconductor.org/packages/2.0/bioc Open Source Contact: Kathleen.marchal@biw.kuleuven.be
TL;DR: This paper proposes a mixture model solution specifically designed for single-point estimation, that provides various advantages over the existing methodology and uses a 314 sample database, constructed with public datasets, to motivate and fit models for the conditional distribution of the observed intensities given allele specific copy numbers.
Abstract: Genomic changes such as copy number alterations are thought to be one of the major underlying causes of human phenotypic variation among normal and disease subjects [23,11,25,26,5,4,7,18]. These include chromosomal regions with so-called copy number alterations: instead of the expected two copies, a section of the chromosome for a particular individual may have zero copies (homozygous deletion), one copy (hemizygous deletions), or more than two copies (amplifications). The canonical example is Down syndrome which is caused by an extra copy of chromosome 21. Identification of such abnormalities in smaller regions has been of great interest, because it is believed to be an underlying cause of cancer.
More than one decade ago comparative genomic hybridization (CGH) technology was developed to detect copy number changes in a highthroughput fashion. However, this technology only provides a 10 MB resolution which limits the ability to detect copy number alterations spanning small regions. It is widely believed that a copy number alteration as small as one base can have significant downstream effects, thus microarray manufacturers have developed technologies that provide much higher resolution. Unfortunately, strong probe effects and variation introduced by sample preparation procedures have made single-point copy number estimates too imprecise to be useful. CGH arrays use a two-color hybridization, usually comparing a sample of interest to a reference sample, which to some degree removes the probe effect. However, the resolution is not nearly high enough to provide single-point copy number estimates.
Various groups have proposed statistical procedures that pool data from neighboring locations to successfully improve precision. However, these procedure need to average across relatively large regions to work effectively thus greatly reducing the resolution. Recently, regression-type models that account for probe-effect have been proposed and appear to improve accuracy as well as precision. In this paper, we propose a mixture model solution specifically designed for single-point estimation, that provides various advantages over the existing methodology. We use a 314 sample database, constructed with public datasets, to motivate and fit models for the conditional distribution of the observed intensities given allele specific copy numbers. With the estimated models in place we can compute posterior probabilities that provide a useful prediction rule as well as a confidence measure for each call. Software to implement this procedure will be available in the Bioconductor oligo package (http://www.bioconductor.org).
TL;DR: The approach consists of data capture and modeling processes rooted in R/Bioconductor, sample annotation and sequence constituent ontology management based in R, secure data archiving in PostgreSQL, and browser-based workspace creation and management rooted in Zope.
Abstract: This paper describes a framework for collecting, annotating, and archiving high-throughput assays from multiple experiments conducted on one or more series of samples. Specific applications include support for large-scale surveys of related transcriptional profiling studies, for investigations of the genetics of gene expression and for joint analysis of copy number variation and mRNA abundance. Our approach consists of data capture and modeling processes rooted in R/Bioconductor, sample annotation and sequence constituent ontology management based in R, secure data archiving in PostgreSQL, and browser-based workspace creation and management rooted in Zope. This effort has generated a completely transparent, extensible, and customizable interface to large archives of high-throughput assays. Sources and prototype interfaces are accessible at www.sgdi.org/software.
TL;DR: A simple method for performing and visualizing sample size calculations for microarray experiments as implemented in the ssize R package, which is available from the Bioconductor project (http://www.bioconductor.org) web site.
Abstract: RNA Expression Microarray technology is widely applied in biomedical and pharmaceutical research. The huge number of RNA concentrations estimated for each sample make it dicult to apply traditional sample size calculation techniques and has left most practitioners to rely on rule-of-thumb techniques. In this paper, we briefly describe and then demonstrate a simple method for performing and visualizing sample size calculations for microarray experiments as implemented in the ssize R package, which is available from the Bioconductor project (http://www.bioconductor.org) web site.
TL;DR: An effort to bring the R package repositories and the Debian Linux distribution together provides a unique statistical environment: essentially all CRAN, BioConductor and Omegahat packages can be installed automatically onto Debian (or Ubuntu) from pre-built binary packages with a single command.
Abstract: Within the world of the R system, language and environment, the CRAN and BioConductor archives have achieved remarkable success in attracting a consistent inflow of new packages of high quality contributions and extensions to the R system. At the same time, the Debian distribution (and its derivatives such as Ubuntu) has continued to make it easier for users to obtain a consistent and complete software installation. In Debian’s case, this has resulted in an unprecedented ten installable architectures. For Ubuntu, a focus on easier installation and added polish means that the ’barriers to entry’ for new users have been lowered, which has resulted in increased market- and mind share for Debian and Ubuntu. This paper presents an effort to bring the R package repositories and the Debian Linux distribution together. This provides a unique statistical environment: essentially all CRAN, BioConductor and Omegahat packages can be installed automatically onto Debian (or Ubuntu) from pre-built binary packages with a single command. Our initial reference builds cover well over 1700 packages taken from the CRAN, BioConductor and Omegahat repositories.
TL;DR: The package Ringo deals with the analysis of two-color oligonucleotide microarrays used in ChIP-chip projects and employs functions from other packages of the Bioconductor project and provides additional Chip-chip-specific and NimbleGen-specific functionalities.
Abstract: The package Ringo deals with the analysis of two-color oligonucleotide microarrays used in ChIP-chip projects. The package was started to facilitate the analysis of two-color microarrays from the company NimbleGen1, but the package has a modular design, such that the platform-specific functionality is encapsulated and analogous two-color tiling array platforms can also be processed. The package employs functions from other packages of the Bioconductor project (Gentleman et al., 2004) and provides additional ChIP-chip-specific and NimbleGen-specific functionalities.
TL;DR: This review aims to make current techniques of statistical design, normalisation and linear analysis of cDNA microarray experiments accessible to a wider community.
Abstract: This review paper, is aimed at biological researchers who are interested in or have begun to use cDNA microarrays for their investigations. Large microarray studies typically involve a multi-disciplinary team with various groups performing different aspects of the same experiment. This approach means that microarrays are less accessible to new researchers than more traditional biological techniques. This review aims to make current techniques of statistical design, normalisation and linear analysis of cDNA microarray experiments accessible to a wider community. These methods will be illustrated with examples that use freely-available packages implemented in Bioconductor and R.
TL;DR: GeneAnnot-based custom CDFs solve the problem of a reliable reconstruction of expression levels and eliminate the existence of more than one probeset per gene, which often leads to discordant expression signals for the same transcript when gene differential expression is the focus of the analysis.
Abstract: Background: Improvements in genome sequence annotation revealed discrepancies in the original probeset/gene assignment in Affymetrix microarray and the existence of differences between annotations and effective alignments of probes and transcription products. In the current generation of Affymetrix human GeneChips, most probesets include probes matching transcripts from more than one gene and probes which do not match any transcribed sequence. Results: We developed a novel set of custom Chip Definition Files (CDF) and the corresponding Bioconductor libraries for Affymetrix human GeneChips, based on the information contained in the GeneAnnot database. GeneAnnot-based CDFs are composed of unique custom-probesets, including only probes matching a single gene. Conclusion: GeneAnnot-based custom CDFs solve the problem of a reliable reconstruction of expression levels and eliminate the existence of more than one probeset per gene, which often leads to discordant expression signals for the same transcript when gene differential expression is the focus of the analysis. GeneAnnot CDFs are freely distributed and fully compliant with Affymetrix standards and all available software for gene expression analysis. The CDF libraries are available from http://www.xlab.unimo.it/GA_CDF, along with supplementary information (CDF libraries, installation guidelines and R code, CDF statistics, and analysis results).
TL;DR: UNLABELLED OneChannelGUI is an add-on Bioconductor package providing a new set of functions extending the capability of the affylmGUI package, providing a graphical interface (GUI) forBioconductor libraries to be used for quality control, normalization, filtering, statistical validation and data mining for single channel microarrays.
Abstract: Summary: OneChannelGUI is an add-on Bioconductor package providing a new set of functions extending the capability of the affylmGUI package. This library provides a graphical interface (GUI) for Bioconductor libraries to be used for quality control, normalization, filtering, statistical validation and data mining for single channel microarrays. Affymetrix 3 0 expression (IVT) arrays as well as the new whole transcript expression arrays, i.e. gene/exon 1.0 ST, are actually implemented. oneChannelGUI is available for most platforms on which R runs, i.e. Windows and Unix-like machines. Availability: http://www.bioconductor.org/packages/2.0/bioc/html/ oneChannelGUI.html
TL;DR: Graph theoretical concepts are given a brief introduction into some of the concepts and their areas of application in molecular biology and a simple application to the integration of a protein-protein interaction and a co-expression network is presented.
Abstract: Graph theoretical concepts are useful for the description and analysis of interactions and relationships in biological systems. We give a brief introduction into some of the concepts and their areas of application in molecular biology. We discuss software that is available through the Bioconductor project and present a simple example application to the integration of a protein-protein interaction and a co-expression network.
TL;DR: TheAffymetrix exon arrays contain probesets intended to target every known and predicted exon in the entire genome, posing significant challenges for high-throughput genome-wide data analysis.
Abstract: Affymetrix exon arrays contain probesets intended to target every known and predicted exon in the entire genome, posing significant challenges for high-throughput genome-wide data analysis. X:MAP http://xmap.picr.man.ac.uk, an annotation database, and exonmap http://www.bioconductor.org/packages/2.0/bioc/html/exonmap.html, a BioConductor/R package, are designed to support fine-grained analysis of exon array data. The system supports the application of standard statistical techniques, prior to the use of genome scale annotation to provide gene-, transcript- and exon-level summaries and visualization tools.
TL;DR: A free, open-source R package Ringo is presented that facilitates the analysis of ChIP-chip experiments by providing functionality for data import, quality assessment, normalization and visualization of the data, and the detection of Chip-enriched genomic regions.
Abstract: Background
Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is a high-throughput assay for DNA-protein-binding or post-translational chromatin/histone modifications. However, the raw microarray intensity readings themselves are not immediately useful to researchers, but require a number of bioinformatic analysis steps. Identified enriched regions need to be bioinformatically annotated and compared to related datasets by statistical methods.
TL;DR: The availability of the RefPlus package containing functions to perform the Extrapolation Strategy and extrapolation Averaging algorithms which address issues of RMA are reported.
Abstract: Summary: RMA has become a widely used methodology to preprocess Affymetrix gene expression microarrays. A limitation of RMA is that the calculated probeset intensities change when a set of microarrays is re-pre-processed after the inclusion of additional microarrays into the analysis set. Here we report the availability of the RefPlus package containing functions to perform the Extrapolation Strategy and Extrapolation Averaging algorithms which address these issues. Availability: The software is implemented in the R language and can be downloaded from the Bioconductor project website (http://
TL;DR: BGX is a new Bioconductor R package that implements an integrated Bayesian approach to the analysis of 3' GeneChip data that performs well relative to other widely used methods at estimating expression levels and fold changes.
Abstract: Affymetrix 3' GeneChip microarrays are widely used to profile the expression of thousands of genes simultaneously They differ from many other microarray types in that GeneChips are hybridised using a single labelled extract and because they contain multiple 'match' and 'mismatch' sequences for each transcript Most algorithms extract the signal from GeneChip experiments in a sequence of separate steps, including background correction and normalisation, which inhibits the simultaneous use of all available information They principally provide a point estimate of gene expression and, in contrast to BGX, do not fully integrate the uncertainty arising from potentially heterogeneous responses of the probes BGX is a new Bioconductor R package that implements an integrated Bayesian approach to the analysis of 3' GeneChip data The software takes into account additive and multiplicative error, non-specific hybridisation and replicate summarisation in the spirit of the model outlined in [1] It also provides a posterior distribution for the expression of each gene Moreover, BGX can take into account probe affinity effects from probe sequence information where available The package employs a novel adaptive Markov chain Monte Carlo (MCMC) algorithm that raises considerably the efficiency with which the posterior distributions are sampled from Finally, BGX incorporates various ways to analyse the results, such as ranking genes by expression level as well as statistically based methods for estimating the amount of up and down regulated genes between two conditions BGX performs well relative to other widely used methods at estimating expression levels and fold changes It has the advantage that it provides a statistically sound measure of uncertainty for its estimates BGX includes various analysis functions to visualise and exploit the rich output that is produced by the Bayesian model
TL;DR: A statistical method in R, called presence-absence calls with negative probesets (PANP) which uses sets of Affymetrix-reported probes with no known hybridization partners on two chip sets: HG- U133A and HG-U133 Plus 2.0.
Abstract: The method currently most used for probeset detection calls on Affymetrix GeneChipreg Human Genome Arrays is provided as part of the MAS5 software. The MAS method uses Wilcoxon statistics for determining presence-absence (MAS-P/A) calls. However, MAS-P/A is only usable with MAS5 processing, which requires the use of both perfect match (PM) and mismatch (MM) probe data in order to call the resulting probeset present or absent. A considerable amount of recent research has convincingly shown that using MM data in gene expression analysis may be problematic. The RMA method, which uses PM data only, is one method that has been developed in response to this. However, there is no publicly available method that works with PM-only expression data to establish presence or absence of genes from the probesets in microarray data. It seems desirable to decouple the method used to generate gene expression values from the method used to make gene detection calls. We have therefore developed a statistical method in R, called presence-absence calls with negative probesets (PANP) which uses sets of Affymetrix-reported probes with no known hybridization partners on two chip sets: HG-U133A and HG-U133 Plus 2.0. PANP allows the use of any Affymetrix microarray data preprocessing method to generate expression values, including PM-only methods as well as PM and MM methods. We present our results on PANP and its performance using the set of 28 HG-U133A chips from a published Affymetrix Latin squares spike-in dataset as well as an internal TaqMan-validated human tissue dataset on the HG-U133 Plus 2.0 chipsets. We And that using these datasets, PANP out-performs the MAS-PA method in several metrics of accuracy and precision using a variety of preprocessing methods: RMA, GCRMA, and even MAS5 itself. PANP out-performs MAS-P/A in probeset detection across a full range of concentrations, especially with low concentration transcripts. An R software package has been prepared for PANP and is available in R as part of the Bioconductor package release at http://www.bioconductor.org.