Detecting Novel Associations in Large Data Sets
David N. Reshef,David N. Reshef,David N. Reshef,Yakir A. Reshef,Yakir A. Reshef,Hilary K. Finucane,Sharon R. Grossman,Sharon R. Grossman,Gilean McVean,Gilean McVean,Peter J. Turnbaugh,Eric S. Lander,Eric S. Lander,Eric S. Lander,Michael Mitzenmacher,Pardis C. Sabeti,Pardis C. Sabeti +16 more
TL;DR: A measure of dependence for two-variable relationships: the maximal information coefficient (MIC), which captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination of the data relative to the regression function.
read more
Abstract: Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig. 1. Computing MIC (A) For each pair (x,y), the MIC algorithm finds the x-by-y grid with the highest induced mutual information. (B) The algorithm normalizes the mutual information scores and compiles a matrix that stores, for each resolution, the best grid at that resolution and its normalized score. (C) The normalized scores form the characteristic matrix, which can be visualized as a surface; MIC corresponds to the highest point on this surface. In this example, there are many grids that achieve the highest score. The star in (B) marks a sample grid achieving this score, and the star in (C) marks that grid’s corresponding location on the surface. 
Fig. 3. Visualizations of the characteristic matrices of common relationships. (A to F) Surfaces representing the characteristic matrices of several common relationship types. For each surface, the x axis represents number of vertical axis bins (rows), the y axis represents number of horizontal axis bins (columns), and the z axis represents the normalized score of the best-performing grid with those dimensions. The inset plots show the relationships used to generate each surface. For surfaces of additional relationships, see fig. S7. 
Fig. 4. Application of MINE to global indicators from the WHO. (A) MIC versus r for all pairwise relationships in the WHO data set. (B) Mutual information (Kraskov et al. estimator) versus r for the same relationships. High mutual information scores tend to be assigned only to relationships with high r, whereas MIC gives high scores also to relationships that are nonlinear. (C to H) Example relationships from (A). (C) Both r and MIC yield low scores for unassociated variables. (D) Ordinary linear relationships score high under both tests. (E to G) Relationships detected by MIC but not by r, because the relationships are nonlinear (E and G) or because more than one relationship is present (F). In (F), the linear trendline comprises a set of
Citations
•Book
Applied Predictive Modeling
Max Kuhn,Kjell Johnson +1 more
- 17 May 2013
TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.
5.9K
Microbial interactions: from networks to models
Karoline Faust,Jeroen Raes +1 more
TL;DR: This Review describes how metagenomics and 16S pyrosequencing techniques are opening the way towards global ecosystem network prediction and the development of ecosystem-wide dynamic models.
3.3K
The human microbiome: at the interface of health and disease
Ilseung Cho,Martin J. Blaser +1 more
TL;DR: The large-scale dynamics of the microbiome can be described by many of the tools and observations used in the study of population ecology, andiphering the metagenome and its aggregate genetic information can also be used to understand the functional properties of the microbial community.
Keystone taxa as drivers of microbiome structure and functioning
TL;DR: A definition of keystone taxa in microbial ecology is proposed and over 200 microbial keystoneTaxa that have been identified in soil, plant and marine ecosystems, as well as in the human microbiome are summarized.
Host Remodeling of the Gut Microbiome and Metabolic Changes during Pregnancy
Omry Koren,Julia K. Goodrich,Tyler C. Cullender,Aymé Spor,Kirsi Laitinen,Helene Kling Bäckhed,Antonio Gonzalez,Jeffrey J. Werner,Largus T. Angenent,Rob Knight,Rob Knight,Fredrik Bäckhed,Erika Isolauri,Seppo Salminen,Ruth E. Ley +14 more
TL;DR: It is indicated that host-microbial interactions that impact host metabolism can occur and may be beneficial in pregnancy and when transferred to germ-free mice, T3 microbiota induced greater adiposity and insulin insensitivity compared to T1.
1.9K
References
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Yoav Benjamini,Yosef Hochberg +1 more
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
•Book
Elements of information theory
Thomas M. Cover,Joy A. Thomas +1 more
- 01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
•Book
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
Trevor Hastie,Robert Tibshirani,Jerome H. Friedman +2 more
- 28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
21.3K
The elements of statistical learning. 2001
Trevor Hastie,Robert Tibshirani,Jerome H. Friedman +2 more
- 01 Jan 2001
17.2K
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
TL;DR: The Elements of Statistical Learning: Data Mining, Inference, and Prediction as discussed by the authors is a popular book for data mining and machine learning, focusing on data mining, inference, and prediction.
15.4K