Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Normalization (statistics)
  4. 2013
  1. Home
  2. Topics
  3. Normalization (statistics)
  4. 2013
Showing papers on "Normalization (statistics) published in 2013"
Journal Article•10.1093/BIOINFORMATICS/BTS680•
A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data

[...]

Andrew E. Teschendorff1, Francesco Marabita1, Matthias Lechner1, Thomas E. Bartlett1, Jesper Tegnér1, David Gomez-Cabrero1, Stephan Beck1 •
University College London1
01 Jan 2013-Bioinformatics
TL;DR: A novel model-based intra-array normalization strategy for 450 k data, called BMIQ (Beta MIxture Quantile dilation), to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes is proposed.
Abstract: Motivation: The Illumina Infinium 450 k DNA Methylation Beadchip is a prime candidate technology for Epigenome-Wide Association Studies (EWAS). However, a difficulty associated with these beadarrays is that probes come in two different designs, characterized by widely different DNA methylation distributions and dynamic range, which may bias downstream analyses. A key statistical issue is therefore how best to adjust for the two different probe designs. Results: Here we propose a novel model-based intra-array normalization strategy for 450 k data, called BMIQ (Beta MIxture Quantile dilation), to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes. The strategy involves application of a three-state beta-mixture model to assign probes to methylation states, subsequent transformation of probabilities into quantiles and finally a methylation-dependent dilation transformation to preserve the monotonicity and continuity of the data. We validate our method on cell-line data, fresh frozen and paraffin-embedded tumour tissue samples and demonstrate that BMIQ compares favourably with two competing methods. Specifically, we show that BMIQ improves the robustness of the normalization procedure, reduces the technical variation and bias of type2 probe values and successfully eliminates the type1 enrichment bias caused by the lower dynamic range of type2 probes. BMIQ will be useful as a preprocessing step for any study using the Illumina Infinium 450 k platform. Availability: BMIQ is freely available from http://code.google.com/p/bmiq/. Contact: a.teschendorff@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online

1,685 citations

Journal Article•10.1093/BIB/BBS046•
A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

[...]

Marie-Agnès Dillies, Andrea Rau, Julie Aubert, Christelle Hennequet-Antier1, Marine Jeanmougin, Nicolas Servant, Céline Keime, Guillemette Marot, David Castel, Jordi Estellé, Gregory Guernec, Bernd Jagla1, Luc Jouneau2, Denis Laloë, Caroline Le Gall, Brigitte Schaeffer2, Stéphane Le Crom, Mickaël Guedj2, Florence Jaffrézic •
Pasteur Institute1, Institut national de la recherche agronomique2
01 Nov 2013-Briefings in Bioinformatics
TL;DR: This work focuses on a comprehensive comparison of seven recently proposed normalization methods for the differential analysis of RNA-seq data, with an emphasis on the use of varied real and simulated datasets involving different species and experimental designs to represent data characteristics commonly observed in practice.
Abstract: During the last 3 years, a number of approaches for the normalization of RNA sequencing data have emerged in the literature, differing both in the type of bias adjustment and in the statistical strategy adopted. However, as data continue to accumulate, there has been no clear consensus on the appropriate normalization method to be used or the impact of a chosen method on the downstream analysis. In this work, we focus on a comprehensive comparison of seven recently proposed normalization methods for the differential analysis of RNA-seq data, with an emphasis on the use of varied real and simulated datasets involving different species and experimental designs to represent data characteristics commonly observed in practice. Based on this comparison study, we propose practical recommendations on the appropriate normalization method to be used and its impact on the differential analysis of RNA-seq data.

1,380 citations

Journal Article•10.1186/1471-2164-14-293•
A data-driven approach to preprocessing Illumina 450K methylation array data

[...]

Ruth Pidsley1, Chloe C. Y. Wong1, Manuela Volta1, Katie Lunnon1, Jonathan Mill2, Jonathan Mill1, Leonard C. Schalkwyk1 •
King's College London1, University of Exeter2
01 May 2013-BMC Genomics
TL;DR: It is demonstrated that quantile normalization methods produce marked improvement, even in highly consistent data, by all three metrics, and that careful selection of preprocessing steps can minimize variance and thus improve statistical power, especially for the detection of the small absolute DNA methylation changes likely associated with complex disease phenotypes.
Abstract: As the most stable and experimentally accessible epigenetic mark, DNA methylation is of great interest to the research community. The landscape of DNA methylation across tissues, through development and in disease pathogenesis is not yet well characterized. Thus there is a need for rapid and cost effective methods for assessing genome-wide levels of DNA methylation. The Illumina Infinium HumanMethylation450 (450K) BeadChip is a very useful addition to the available methods for DNA methylation analysis but its complex design, incorporating two different assay methods, requires careful consideration. Accordingly, several normalization schemes have been published. We have taken advantage of known DNA methylation patterns associated with genomic imprinting and X-chromosome inactivation (XCI), in addition to the performance of SNP genotyping assays present on the array, to derive three independent metrics which we use to test alternative schemes of correction and normalization. These metrics also have potential utility as quality scores for datasets. The standard index of DNA methylation at any specific CpG site is β = M/(M + U + 100) where M and U are methylated and unmethylated signal intensities, respectively. Betas (βs) calculated from raw signal intensities (the default GenomeStudio behavior) perform well, but using 11 methylomic datasets we demonstrate that quantile normalization methods produce marked improvement, even in highly consistent data, by all three metrics. The commonly used procedure of normalizing betas is inferior to the separate normalization of M and U, and it is also advantageous to normalize Type I and Type II assays separately. More elaborate manipulation of quantiles proves to be counterproductive. Careful selection of preprocessing steps can minimize variance and thus improve statistical power, especially for the detection of the small absolute DNA methylation changes likely associated with complex disease phenotypes. For the convenience of the research community we have created a user-friendly R software package called wateRmelon, downloadable from bioConductor, compatible with the existing methylumi, minfi and IMA packages, that allows others to utilize the same normalization methods and data quality tests on 450K data.

1,124 citations

Journal Article•10.1002/CYTO.A.22271•
Normalization of mass cytometry data with bead standards

[...]

Rachel Finck1, Erin F. Simonds1, Astraea Jager1, Smita Krishnaswamy2, Karen Sachs1, Wendy J. Fantl1, Dana Pe'er2, Garry P. Nolan1, Sean C. Bendall1 •
Stanford University1, Columbia University2
01 May 2013-Cytometry Part A
TL;DR: The protocol described here includes simultaneous measurements of beads and cells on the mass cytometer, subsequent extraction of the bead‐based signature, and the application of an algorithm enabling correction of both short‐ and long‐term signal fluctuations.
Abstract: Mass cytometry uses atomic mass spectrometry combined with isotopically pure reporter elements to currently measure as many as 40 parameters per single cell. As with any quantitative technology, there is a fundamental need for quality assurance and normalization protocols. In the case of mass cytometry, the signal variation over time due to changes in instrument performance combined with intervals between scheduled maintenance must be accounted for and then normalized. Here, samples were mixed with polystyrene beads embedded with metal lanthanides, allowing monitoring of mass cytometry instrument performance over multiple days of data acquisition. The protocol described here includes simultaneous measurements of beads and cells on the mass cytometer, subsequent extraction of the bead-based signature, and the application of an algorithm enabling correction of both short- and long-term signal fluctuations. The variation in the intensity of the beads that remains after normalization may also be used to determine data quality. Application of the algorithm to a one-month longitudinal analysis of a human peripheral blood sample reduced the range of median signal fluctuation from 4.9-fold to 1.3-fold.

774 citations

Journal Article•10.1186/1471-2105-14-219•
TCC: an R package for comparing tag count data with robust normalization strategies

[...]

Jianqiang Sun1, Tomoaki Nishiyama2, Kentaro Shimizu1, Koji Kadota1•
University of Tokyo1, Kanazawa University2
09 Jul 2013-BMC Bioinformatics
TL;DR: DEGES in TCC is essential for accurate normalization of tag count data, especially when up- and down-regulated DEGs in one of the samples are extremely biased in their number.
Abstract: Differential expression analysis based on “next-generation” sequencing technologies is a fundamental means of studying RNA expression. We recently developed a multi-step normalization method (called TbT) for two-group RNA-seq data with replicates and demonstrated that the statistical methods available in four R packages (edgeR, DESeq, baySeq, and NBPSeq) together with TbT can produce a well-ranked gene list in which true differentially expressed genes (DEGs) are top-ranked and non-DEGs are bottom ranked. However, the advantages of the current TbT method come at the cost of a huge computation time. Moreover, the R packages did not have normalization methods based on such a multi-step strategy. TCC (an acronym for Tag Count Comparison) is an R package that provides a series of functions for differential expression analysis of tag count data. The package incorporates multi-step normalization methods, whose strategy is to remove potential DEGs before performing the data normalization. The normalization function based on this DEG elimination strategy (DEGES) includes (i) the original TbT method based on DEGES for two-group data with or without replicates, (ii) much faster methods for two-group data with or without replicates, and (iii) methods for multi-group comparison. TCC provides a simple unified interface to perform such analyses with combinations of functions provided by edgeR, DESeq, and baySeq. Additionally, a function for generating simulation data under various conditions and alternative DEGES procedures consisting of functions in the existing packages are provided. Bioinformatics scientists can use TCC to evaluate their methods, and biologists familiar with other R packages can easily learn what is done in TCC. DEGES in TCC is essential for accurate normalization of tag count data, especially when up- and down-regulated DEGs in one of the samples are extremely biased in their number. TCC is useful for analyzing tag count data in various scenarios ranging from unbiased to extremely biased differential expression. TCC is available at http://www.iu.a.u-tokyo.ac.jp/~kadota/TCC/ and will appear in Bioconductor ( http://bioconductor.org/ ) from ver. 2.13.

545 citations

Journal Article•10.1093/BIOINFORMATICS/BTT474•
DNorm: disease name normalization with pairwise learning to rank.

[...]

Robert Leaman1, Rezarta Islamaj Dogan1, Zhiyong Lu1•
Arizona State University1
15 Nov 2013-Bioinformatics
TL;DR: This article introduces the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM, a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data.
Abstract: Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macroaveraged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www. ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a webbased demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm. nih.gov/CBBresearch/Lu/Demo/PubTator

544 citations

Journal Article•10.1371/JOURNAL.PONE.0080635•
Maximum allowed solvent accessibilites of residues in proteins.

[...]

Matthew Z. Tien1, A. Meyer2, A. Meyer3, Dariya K. Sydykova3, Stephanie J. Spielman3, Claus O. Wilke3 •
University of Chicago1, Texas Tech University Health Sciences Center2, University of Texas at Austin3
21 Nov 2013-PLOS ONE
TL;DR: It is concluded that previously published ASA normalization values were too small, primarily because the conformations that maximize ASA had not been correctly identified, and a new normalization scale is derived that does provide a tight upper bound on observed ASA values.
Abstract: The relative solvent accessibility (RSA) of a residue in a protein measures the extent of burial or exposure of that residue in the 3D structure. RSA is frequently used to describe a protein's biophysical or evolutionary properties. To calculate RSA, a residue's solvent accessibility (ASA) needs to be normalized by a suitable reference value for the given amino acid; several normalization scales have previously been proposed. However, these scales do not provide tight upper bounds on ASA values frequently observed in empirical crystal structures. Instead, they underestimate the largest allowed ASA values, by up to 20%. As a result, many empirical crystal structures contain residues that seem to have RSA values in excess of one. Here, we derive a new normalization scale that does provide a tight upper bound on observed ASA values. We pursue two complementary strategies, one based on extensive analysis of empirical structures and one based on systematic enumeration of biophysically allowed tripeptides. Both approaches yield congruent results that consistently exceed published values. We conclude that previously published ASA normalization values were too small, primarily because the conformations that maximize ASA had not been correctly identified. As an application of our results, we show that empirically derived hydrophobicity scales are sensitive to accurate RSA calculation, and we derive new hydrophobicity scales that show increased correlation with experimentally measured scales.

464 citations

Journal Article•10.1016/J.AB.2012.10.010•
Stain-Free technology as a normalization tool in Western blot analysis

[...]

Anne Gürtler, Nancy Kunz1, Maria Gomolka, Sabine Hornhardt, Anna A. Friedl, Kevin Mcdonald1, Jonathan E. Kohn1, Anton Posch1 •
Bio-Rad Laboratories1
15 Feb 2013-Analytical Biochemistry
TL;DR: Stain-Free technology appears to be more reliable, more robust, and more sensitive to small effects of protein regulation when compared with HKP normalization with GAPDH.

369 citations

Journal Article•10.1073/PNAS.1217854110•
Normalization is a general neural mechanism for context-dependent decision making

[...]

Kenway Louie1, Mel W. Khaw1, Paul W. Glimcher2•
New York University1, Center for Neural Science2
09 Apr 2013-Proceedings of the National Academy of Sciences of the United States of America
TL;DR: It is shown that choice models using normalization generate significant choice phenomena driven by either the value or number of alternative options, suggesting that the neural mechanism of value coding critically influences stochastic choice behavior and provide a generalizable quantitative framework for examining context effects in decision making.
Abstract: Understanding the neural code is critical to linking brain and behavior. In sensory systems, divisive normalization seems to be a canonical neural computation, observed in areas ranging from retina to cortex and mediating processes including contrast adaptation, surround suppression, visual attention, and multisensory integration. Recent electrophysiological studies have extended these insights beyond the sensory domain, demonstrating an analogous algorithm for the value signals that guide decision making, but the effects of normalization on choice behavior are unknown. Here, we show that choice models using normalization generate significant (and classically irrational) choice phenomena driven by either the value or number of alternative options. In value-guided choice experiments, both monkey and human choosers show novel context-dependent behavior consistent with normalization. These findings suggest that the neural mechanism of value coding critically influences stochastic choice behavior and provide a generalizable quantitative framework for examining context effects in decision making.

337 citations

Journal Article•10.1038/NRN3424•
Erratum: Normalization as a canonical neural computation

[...]

Matteo Carandini, David J. Heeger
01 Feb 2013-Nature Reviews Neuroscience
TL;DR: On page 52 of this article, in the legend for figure 1, the text “lower concentrations are shown by lighter colours” should have read “ lower concentrations are show by darker colours’.
Abstract: Nature Reviews Neuroscience 13, 51–62 (2012) On page 52 of this article, in the legend for figure 1, the text “lower concentrations are shown by lighter colours” should have read “lower concentrations are shown by darker colours”. This has been corrected in the online version.

290 citations

Journal Article•10.47893/IJCCT.2013.1201•
Min Max Normalization Based Data Perturbation Method for Privacy Protection

[...]

Yogendra Kumar Jain1, Santosh Kumar Bhandare1•
Samrat Ashok Technological Institute1
1 Oct 2013
Journal Article•10.1109/TPAMI.2012.233•
Coupled Gaussian processes for pose-invariant facial expression recognition

[...]

Ognjen Rudovic1, Maja Pantic1, Ioannis Patras2•
Imperial College London1, Queen Mary University of London2
01 Jun 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence
TL;DR: The proposed Coupled Scaled Gaussian Process Regression model for head-posing normalization outperforms state-of-the-art regression-based approaches to head-pose normalization, 2D and 3D Point Distribution Models (PDMs), and Active Appearance Models (AAMs), especially in cases of unknown poses and imbalanced training data.
Abstract: We propose a method for head-pose invariant facial expression recognition that is based on a set of characteristic facial points. To achieve head-pose invariance, we propose the Coupled Scaled Gaussian Process Regression (CSGPR) model for head-pose normalization. In this model, we first learn independently the mappings between the facial points in each pair of (discrete) nonfrontal poses and the frontal pose, and then perform their coupling in order to capture dependences between them. During inference, the outputs of the coupled functions from different poses are combined using a gating function, devised based on the head-pose estimation for the query points. The proposed model outperforms state-of-the-art regression-based approaches to head-pose normalization, 2D and 3D Point Distribution Models (PDMs), and Active Appearance Models (AAMs), especially in cases of unknown poses and imbalanced training data. To the best of our knowledge, the proposed method is the first one that is able to deal with expressive faces in the range from $(-45^\circ)$ to $(+45^\circ)$ pan rotation and $(-30^\circ)$ to $(+30^\circ)$ tilt rotation, and with continuous changes in head pose, despite the fact that training was conducted on a small set of discrete poses. We evaluate the proposed method on synthetic and real images depicting acted and spontaneously displayed facial expressions.
Journal Article•10.1007/S00216-012-6517-2•
Normalization procedures and reference material selection in stable HCNOS isotope analyses: an overview

[...]

Grzegorz Skrzypek1•
University of Western Australia1
01 Mar 2013-Analytical and Bioanalytical Chemistry
TL;DR: The uncertainties of stable isotope results depend not only on the technical aspects of measurements, but also on how raw data are normalized to one of the international isotope scales, so unification of the data processing protocols employed is highly desirable.
Abstract: The uncertainties of stable isotope results depend not only on the technical aspects of measurements, but also on how raw data are normalized to one of the international isotope scales. The inconsistency in the normalization methods used and in the selection of standards may lead to substantial differences in the results obtained. Therefore, unification of the data processing protocols employed is highly desirable. The best performing methods are two-point or multipoint normalization methods based on linear regression. Linear regression is most robust when based on standards that cover the entire range of δ values typically observed in nature, regardless of the δ values of the samples analysed. The uncertainty can be reduced by 50 % if measurements of two different standards are performed four times, or measurements of four standards are performed twice, with each batch of samples. Chemical matrix matching between standards and samples seems to be critical for δ 18O of nitrate or δ 2H of hair samples (thermal conversion/elemental analyser), for example; however, it is not necessarily always critical for all types of samples and techniques (e.g. not for most δ 15N and δ 13C elemental analyser analyses). To ensure that all published data can be recalculated, if δ values of standards or the isotope scales are to be updated, the details of the normalization technique and the δ values of the standards used should always be clearly reported.
Journal Article•10.1007/S11192-012-0913-4•
Source normalized indicators of citation impact: an overview of different approaches and an empirical comparison

[...]

Ludo Waltman1, Nees Jan van Eck1•
Leiden University1
01 Sep 2013-Scientometrics
TL;DR: In this paper, an overview of the source normalization approaches is provided and empirically compared with a traditional normalization approach based on a field classification system, and the issue of the selection of the journals to be included in a normalization for field differences is discussed.
Abstract: Different scientific fields have different citation practices. Citation-based bibliometric indicators need to normalize for such differences between fields in order to allow for meaningful between-field comparisons of citation impact. Traditionally, normalization for field differences has usually been done based on a field classification system. In this approach, each publication belongs to one or more fields and the citation impact of a publication is calculated relative to the other publications in the same field. Recently, the idea of source normalization was introduced, which offers an alternative approach to normalize for field differences. In this approach, normalization is done by looking at the referencing behavior of citing publications or citing journals. In this paper, we provide an overview of a number of source normalization approaches and we empirically compare these approaches with a traditional normalization approach based on a field classification system. We also pay attention to the issue of the selection of the journals to be included in a normalization for field differences. Our analysis indicates a number of problems of the traditional classification-system-based normalization approach, suggesting that source normalization approaches may yield more accurate results.
Journal Article•10.1007/S11192-012-0940-1•
On bibliographic networks

[...]

Vladimir Batagelj1, Monika Cerinšek•
University of Ljubljana1
20 Jan 2013-arXiv: Social and Information Networks
TL;DR: In this paper, the authors show that the bibliographic data can be transformed into a collection of compatible networks using network multiplication, and they also discuss the question when the multiplication of sparse networks preserves sparseness.
Abstract: In the paper we show that the bibliographic data can be transformed into a collection of compatible networks. Using network multiplication different interesting derived networks can be obtained. In defining them an appropriate normalization should be considered. The proposed approach can be applied also to other collections of compatible networks. We also discuss the question when the multiplication of sparse networks preserves sparseness. The proposed approaches are illustrated with analyses of collection of networks on the topic "social network" obtained from the Web of Science.
Journal Article•10.1016/J.JNEUMETH.2013.03.019•
Preprocessing effects of 22 linear univariate features on the performance of seizure prediction methods.

[...]

Jalil Rasekhi1, Mohammad Reza Karami Mollaei1, Mojtaba Bandarabadi2, Cesar Teixeira2, António Dourado2 •
Babol Noshirvani University of Technology1, University of Coimbra2
15 Jul 2013-Journal of Neuroscience Methods
TL;DR: Combining multiple linear univariate features in one feature space and classifying the feature space using machine learning methods could predict epileptic seizures in patients suffering from refractory epilepsy.
Journal Article•
Multivariate functional principal component analysis: A normalization approach

[...]

Jeng-Min Chiou, Ya-Fang Yang, Yu-Ting Chen
01 May 2013-Annals of Statistics
TL;DR: In this article, a modified version of the classical Karhunen-Loeve expansion for a vector-valued random process, called a normalized multivariate functional principal component (mFPCn) approach, was proposed as a general stochastic representation for multivariate random functions.
Abstract: This study proposes a modified version of the classical Karhunen-Loeve expansion for a vector-valued random process, called a normalized multivariate functional principal component (mFPCn) approach, as a general stochastic representation for multivariate random functions. The mFPCn approach takes the varying extent of variations between the components of multivariate random functions into account and takes advantage of component dependency through the pairwise cross-covariance functions. The multivariate approach leads to a single set of multivariate functional principal component scores, which serves well as the proxy of multivariate functional data. We derive the consistency properties for the estimates of the mFPCn model components, and the asymptotic distributions for statistical inferences. We illustrate the finite sample performance of the mFPCn approach through the analysis of a traffic flow data set, including an application to clustering multivariate functional data derived from the mFPCn approach and a simulation study. The mFPCn approach serves as a basic and useful statistical tool for multivariate functional data analysis.
Journal Article•10.1093/BIOINFORMATICS/BTT511•
SAMstrt: statistical test for differential expression in single-cell transcriptome with spike-in normalization

[...]

Shintaro Katayama1, Virpi Töhönen1, Sten Linnarsson1, Juha Kere1•
Science for Life Laboratory1
15 Nov 2013-Bioinformatics
TL;DR: SAMstrt as mentioned in this paper is an extension code for SAMseq, which is a statistical method for differential expression, to enable spike-in normalization and statistical testing based on the estimated absolute number of transcripts per cell for single-cell RNA-seq methods.
Abstract: Motivation: Recent transcriptome studies have revealed that total transcript numbers vary by cell type and condition; therefore, the statistical assumptions for single-cell transcriptome studies must be revisited. SAMstrt is an extension code for SAMseq, which is a statistical method for differential expression, to enable spike-in normalization and statistical testing based on the estimated absolute number of transcripts per cell for single-cell RNA-seq methods. Availability and Implementation: SAMstrt is implemented on R and available in github (https://github.com/shka/R-SAMstrt). Contact: shintaro.katayama@ki.se Supplementary Information: Supplementary data are available at Bioinformatics online.
Journal Article•10.1109/TASL.2013.2255278•
On Acoustic Emotion Recognition: Compensating for Covariate Shift

[...]

Ali Hassan1, Robert I. Damper1, Mahesan Niranjan1•
University of Southampton1
01 Jul 2013-IEEE Transactions on Audio, Speech, and Language Processing
TL;DR: In this paper, the authors employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift.
Abstract: Pattern recognition tasks often face the situation that training data are not fully representative of test data. This problem is well-recognized in speech recognition, where methods like cepstral mean normalization (CMN), vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) are used to compensate for channel and speaker differences. Speech emotion recognition (SER) is an important emerging field in human-computer interaction and faces the same data shift problems, a fact which has been generally overlooked in this domain. In this paper, we show that compensating for channel and speaker differences can give significant improvements in SER by modelling these differences as a covariate shift. We employ three algorithms from the domain of transfer learning that apply importance weights (IWs) within a support vector machine classifier to reduce the effects of covariate shift. We test these methods on the FAU Aibo Emotion Corpus, which was used in the Interspeech 2009 Emotion Challenge. It consists of two separate parts recorded independently at different schools; hence the two parts exhibit covariate shift. Results show that the IW methods outperform combined CMN and VTLN and significantly improve on the baseline performance of the Challenge. The best of the three methods also improves significantly on the winning contribution to the Challenge.
Journal Article•10.4161/CIB.25849•
Comparison of normalization methods for differential gene expression analysis in RNA-Seq experiments: A matter of relative size of studied transcriptomes.

[...]

Elie Maza1, Pierre Frasse2, Pavel Senin, Mondher Bouzayen, Mohamed Zouine3 •
University of Toulouse1, Entertainments National Service Association2, Institut national de la recherche agronomique3
30 Jul 2013-Communicative & Integrative Biology
TL;DR: This study compares the most widespread normalization procedures and proposes a novel one aiming at removing an inherent bias of studied transcriptomes related to their relative size, named “Median Ratio Normalization” (MRN).
Abstract: In recent years, RNA-Seq technologies became a powerful tool for transcriptome studies. However, computational methods dedicated to the analysis of high-throughput sequencing data are yet to be standardized. In particular, it is known that the choice of a normalization procedure leads to a great variability in results of differential gene expression analysis. The present study compares the most widespread normalization procedures and proposes a novel one aiming at removing an inherent bias of studied transcriptomes related to their relative size. Comparisons of the normalization procedures are performed on real and simulated data sets. Real RNA-Seq data sets analyses, performed with all the different normalization methods, show that only 50% of significantly differentially expressed genes are common. This result highlights the influence of the normalization step on the differential expression analysis. Real and simulated data sets analyses give similar results showing 3 different groups of procedures having the same behavior. The group including the novel method named "Median Ratio Normalization" (MRN) gives the lower number of false discoveries. Within this group the MRN method is less sensitive to the modification of parameters related to the relative size of transcriptomes such as the number of down- and upregulated genes and the gene expression levels. The newly proposed MRN method efficiently deals with intrinsic bias resulting from relative size of studied transcriptomes. Validation with real and simulated data sets confirmed that MRN is more consistent and robust than existing methods.
Journal Article•10.1103/PHYSREVC.88.064002•
Coarse-grained potential analysis of neutron-proton and proton-proton scattering below the pion production threshold

[...]

R. Navarro Pérez, J. E. Amaro, E. Ruiz Arriola
06 Dec 2013-Physical Review C
TL;DR: In this article, the authors presented a successful fit to neutron-proton and protonproton scattering data below pion production threshold using the delta-shell representation, which includes data within the years 1950 to 2013.
Abstract: Using the delta-shell representation we present a successful fit to neutron-proton and proton-proton scattering data below pion production threshold. A detailed overview of the theory necessary to calculate observables with this potential is presented. A new data selection process is used to obtain the largest mutually consistent data base. The analysis includes data within the years 1950 to 2013. Using 46 parameters we obtain chi^2/Ndata = 1.04 with Ndata = 6713 including normalization data. Phase shifts with error bars are provided.
Proceedings Article•
A Log-Linear Model for Unsupervised Text Normalization

[...]

Yi Yang1, Jacob Eisenstein1•
Georgia Institute of Technology1
1 Oct 2013
TL;DR: This work uses the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
Abstract: We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. The weights of these features are trained in a maximumlikelihood framework, employing a novel sequential Monte Carlo training algorithm to overcome the large label space, which would be impractical for traditional dynamic programming solutions. This model is implemented in a normalization system called UNLOL, which achieves the best known results on two normalization datasets, outperforming more complex systems. We use the output of UNLOL to automatically normalize a large corpus of social media text, revealing a set of coherent orthographic styles that underlie online language variation.
Journal Article•10.1021/AC401559V•
Measurement of DNA concentration as a normalization strategy for metabolomic data from adherent cell lines

[...]

Leslie P. Silva1, Philip L. Lorenzi1, Preeti Purwaha1, Valeda Yong1, David H. Hawke1, John N. Weinstein1 •
University of Texas MD Anderson Cancer Center1
02 Oct 2013-Analytical Chemistry
TL;DR: It is concluded that DNA concentration is a widely applicable method for normalizing metabolomic data from adherent cell lines.
Abstract: Metabolomics is a rapidly advancing field, and much of our understanding of the subject has come from research on cell lines. However, the results and interpretation of such studies depend on appropriate normalization of the data; ineffective or poorly chosen normalization methods can lead to frankly erroneous conclusions. That is a recurrent challenge because robust, reliable methods for normalization of data from cells have not been established. In this study, we have compared several methods for normalization of metabolomic data from cell extracts. Total protein concentration, cell count, and DNA concentration exhibited strong linear correlations with seeded cell number, but DNA concentration was found to be the most generally useful method for the following reasons: (1) DNA concentration showed the greatest consistency across a range of cell numbers; (2) DNA concentration was the closest to proportional with cell number; (3) DNA samples could be collected from the same dish as the metabolites; and (4)...
Proceedings Article•10.1109/ASRU.2013.6707731•
Score normalization and system combination for improved keyword spotting

[...]

Damianos Karakos1, Richard Schwartz1, Stavros Tsakalidis1, Le Zhang1, Shivesh Ranjan1, Tim Ng1, Roger Hsiao1, Guruprasad Saikumar1, Ivan Bulyko1, Long Nguyen1, John Makhoul1, Frantisek Grezl2, Mirko Hannemann2, Martin Karafiat2, Igor Szöke2, Karel Vesely2, Lori Lamel, Viet Bac Le3 •
Raytheon1, Brno University of Technology2, Vocapia Research3
1 Dec 2013
TL;DR: Two techniques are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures, which resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.
Abstract: We present two techniques that are shown to yield improved Keyword Spotting (KWS) performance when using the ATWV/MTWV performance measures: (i) score normalization, where the scores of different keywords become commensurate with each other and they more closely correspond to the probability of being correct than raw posteriors; and (ii) system combination, where the detections of multiple systems are merged together, and their scores are interpolated with weights which are optimized using MTWV as the maximization criterion. Both score normalization and system combination approaches show that significant gains in ATWV/MTWV can be obtained, sometimes on the order of 8-10 points (absolute), in five different languages. A variant of these methods resulted in the highest performance for the official surprise language evaluation for the IARPA-funded Babel project in April 2013.
Journal Article•10.1007/S11192-012-0898-Z•
Opinion paper: thoughts and facts on bibliometric indicators

[...]

Wolfgang Glänzel1, Henk F. Moed2•
Katholieke Universiteit Leuven1, Elsevier2
01 Jul 2013-Scientometrics
TL;DR: This paper aims at contributing to the on-going discussion about building and applying bibliometric indicators by shedding light on their properties and requirements concerning six different aspects: deterministic versus probabilistic approach, application-related properties, the time dependence, normalization issues, size dependence and network indicators.
Abstract: This paper aims at contributing to the on-going discussion about building and applying bibliometric indicators. It sheds light on their properties and requirements concerning six different aspects: deterministic versus probabilistic approach, application-related properties, the time dependence, normalization issues, size dependence and network indicators.
Journal Article•10.1214/14-AOS1285•
Role of Normalization in Spectral Clustering for Stochastic Blockmodels

[...]

Purnamrita Sarkar, Peter J. Bickel
05 Oct 2013-arXiv: Machine Learning
TL;DR: This paper theoretically shows that normalization shrinks the spread of points in a class by a constant fraction under a broad parameter regime and obtains sharp deviation bounds of empirical principal eigenvalues of graphs generated from a stochastic blockmodel.
Abstract: Spectral clustering is a technique that clusters elements using the top few eigenvectors of their (possibly normalized) similarity matrix. The quality of spectral clustering is closely tied to the convergence properties of these principal eigenvectors. This rate of convergence has been shown to be identical for both the normalized and unnormalized variants in recent random matrix theory literature. However, normalization for spectral clustering is commonly believed to be beneficial [Stat. Comput. 17 (2007) 395-416]. Indeed, our experiments show that normalization improves prediction accuracy. In this paper, for the popular stochastic blockmodel, we theoretically show that normalization shrinks the spread of points in a class by a constant fraction under a broad parameter regime. As a byproduct of our work, we also obtain sharp deviation bounds of empirical principal eigenvalues of graphs generated from a stochastic blockmodel.
Patent•
Loudness normalization based on user feedback

[...]

Frank M. Baumgarte1•
Apple Inc.1
25 Nov 2013
TL;DR: In this paper, a system for performing loudness normalization based on user feedback is described, where a content delivery system and a plurality of audio playback devices are used to communicate volume settings used during playback of pieces of sound program content.
Abstract: A system for performing loudness normalization based on user feedback is described herein. The system includes a content delivery system and a plurality of audio playback devices. The audio playback devices may communicate volume settings used during playback of pieces of sound program content to the content delivery system. Based on these collected data points, a statistical analysis may be performed to generate loudness adjustment values for pieces of sound program content. The loudness adjustment values may be communicated to the audio playback devices through metadata in associated pieces of sound program content or as separate communications from the content delivery system. An offline version is also described that supports individual loudness normalization adjustments based on a single user's preferences. Under either system, loudness normalization may be achieved based on real world volume settings for individual pieces of sound program content played using various playback configurations.
Journal Article•10.1063/1.4818323•
Absolute cross-section normalization of magnetic neutron scattering data

[...]

Guangyong Xu1, Zhijun Xu, John M. Tranquada•
Brookhaven National Laboratory1
19 Aug 2013-Review of Scientific Instruments
TL;DR: In this paper, various methods to obtain the resolution volume for neutron scattering experiments, in order to perform absolute normalization on inelastic magnetic neutron scattering data, are discussed and the advantages of different normalization processes are discussed.
Abstract: We discuss various methods to obtain the resolution volume for neutron scattering experiments, in order to perform absolute normalization on inelastic magnetic neutron scattering data. Examples from previous experiments are given. We also try to provide clear definitions of a number of physical quantities which are commonly used to describe neutron magnetic scattering results, including the dynamic spin correlation function and the imaginary part of the dynamic susceptibility. Formulas that can be used for general purposes are provided and the advantages of the different normalization processes are discussed.
Journal Article•10.1021/AC401400B•
Combination of Injection Volume Calibration by Creatinine and MS Signals’ Normalization to Overcome Urine Variability in LC-MS-Based Metabolomics Studies

[...]

Yanhua Chen1, Guoqing Shen1, Ruiping Zhang1, Jiuming He1, Yi Zhang1, Jing Xu1, Wei Yang1, Xiaoguang Chen1, Yongmei Song1, Zeper Abliz1 •
Peking Union Medical College1
02 Aug 2013-Analytical Chemistry
TL;DR: The results showed that the calibration of injection volumes based on creatinine values could effectively eliminate intragroup differences caused by variations in the concentrations of urinary metabolites, thus giving better parallelism and clustering effects and peak area normalization could further eliminate intraclass differences.
Abstract: It is essential to choose one preprocessing method for liquid chromatography–mass spectrometry (LC-MS)-based metabolomics studies of urine samples in order to overcome their variability. However, the commonly used normalization methods do not substantially reduce the high variabilities arising from differences in urine concentration, especially for signal saturation (abundant metabolites exceed the dynamic range of the instrumentation) or missing values. Herein, a simple preacquisition strategy based on differential injection volumes calibrated by creatinine (to reduce the concentration differences between the samples), combined with normalization to “total useful MS signals” or “all MS signals”, is proposed to overcome urine variabilities. This strategy was first systematically compared with other popular normalization methods by application to serially diluted urine samples. Then, the method has been verified using rat urine samples of pre- and postinoculation of Walker 256 carcinoma cells. The results ...
Journal Article•10.1145/2414425.2414428•
Named entity recognition for tweets

[...]

Xiaohua Liu1, Furu Wei2, Shaodian Zhang3, Ming Zhou2•
Harbin Institute of Technology1, Microsoft2, Shanghai Jiao Tong University3
01 Feb 2013-ACM Transactions on Intelligent Systems and Technology
TL;DR: A novel method consisting of a combination of a K-Nearest Neighbors classifier with a linear Conditional Random Fields model, a KNN-based classifier, and a semisupervised learning framework to solve the challenges of Named Entity Recognition for tweets.
Abstract: Two main challenges of Named Entity Recognition (NER) for tweets are the insufficient information in a tweet and the lack of training data. We propose a novel method consisting of three core elements: (1) normalization of tweets; (2) combination of a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model; and (3) semisupervised learning framework. The tweet normalization preprocessing corrects common ill-formed words using a global linear model. The KNN-based classifier conducts prelabeling to collect global coarse evidence across tweets while the CRF model conducts sequential labeling to capture fine-grained information encoded in a tweet. The semisupervised learning plus the gazetteers alleviate the lack of training data. Extensive experiments show the advantages of our method over the baselines as well as the effectiveness of normalization, KNN, and semisupervised learning.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve