Detecting Novel Associations in Large Data Sets

doi:10.1126/SCIENCE.1205438

Open AccessJournal Article10.1126/SCIENCE.1205438

Detecting Novel Associations in Large Data Sets

David N. Reshef, +16 more

- 16 Dec 2011

- Science

- Vol. 334, Iss: 6062, pp 1518-1524

3.1K

TL;DR: A measure of dependence for two-variable relationships: the maximal information coefficient (MIC), which captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination of the data relative to the regression function.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig. 1. Computing MIC (A) For each pair (x,y), the MIC algorithm finds the x-by-y grid with the highest induced mutual information. (B) The algorithm normalizes the mutual information scores and compiles a matrix that stores, for each resolution, the best grid at that resolution and its normalized score. (C) The normalized scores form the characteristic matrix, which can be visualized as a surface; MIC corresponds to the highest point on this surface. In this example, there are many grids that achieve the highest score. The star in (B) marks a sample grid achieving this score, and the star in (C) marks that grid’s corresponding location on the surface.

Fig. 3. Visualizations of the characteristic matrices of common relationships. (A to F) Surfaces representing the characteristic matrices of several common relationship types. For each surface, the x axis represents number of vertical axis bins (rows), the y axis represents number of horizontal axis bins (columns), and the z axis represents the normalized score of the best-performing grid with those dimensions. The inset plots show the relationships used to generate each surface. For surfaces of additional relationships, see fig. S7.

Fig. 4. Application of MINE to global indicators from the WHO. (A) MIC versus r for all pairwise relationships in the WHO data set. (B) Mutual information (Kraskov et al. estimator) versus r for the same relationships. High mutual information scores tend to be assigned only to relationships with high r, whereas MIC gives high scores also to relationships that are nonlinear. (C to H) Example relationships from (A). (C) Both r and MIC yield low scores for unassociated variables. (D) Ordinary linear relationships score high under both tests. (E to G) Relationships detected by MIC but not by r, because the relationships are nonlinear (E and G) or because more than one relationship is present (F). In (F), the linear trendline comprises a set of

Citations

•Book

Applied Predictive Modeling

Max Kuhn, +1 more

- 17 May 2013

TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.

...read moreread less

5.9K

Journal Article•10.1038/NRMICRO2832

Microbial interactions: from networks to models

Karoline Faust, +1 more

- 16 Jul 2012

- Nature Reviews Microbiology

TL;DR: This Review describes how metagenomics and 16S pyrosequencing techniques are opening the way towards global ecosystem network prediction and the development of ecosystem-wide dynamic models.

...read moreread less

3.3K

•Journal Article•10.1038/NRG3182

The human microbiome: at the interface of health and disease

Ilseung Cho, +1 more

- 13 Mar 2012

- Nature Reviews Genetics

TL;DR: The large-scale dynamics of the microbiome can be described by many of the tools and observations used in the study of population ecology, andiphering the metagenome and its aggregate genetic information can also be used to understand the functional properties of the microbial community.

...read moreread less

2.9K

•Journal Article•10.1038/S41579-018-0024-1

Keystone taxa as drivers of microbiome structure and functioning

Samiran Banerjee, +3 more

- 01 Sep 2018

- Nature Reviews Microbiology

TL;DR: A definition of keystone taxa in microbial ecology is proposed and over 200 microbial keystoneTaxa that have been identified in soil, plant and marine ecosystems, as well as in the human microbiome are summarized.

...read moreread less

2.1K

...

Expand

References

Journal Article•10.1111/J.2517-6161.1995.TB02031.X

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995

- Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

104.5K

•Book

Elements of information theory

Thomas M. Cover, +1 more

- 01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

52.2K

•Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, +2 more

- 28 Jul 2013

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.

...read moreread less

21.3K

The elements of statistical learning. 2001

Trevor Hastie, +2 more

- 01 Jan 2001

17.2K

Journal Article•10.1198/JASA.2004.S339

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

David Ruppert

- 01 Jun 2004

- Journal of the American Statistical Asso...

TL;DR: The Elements of Statistical Learning: Data Mining, Inference, and Prediction as discussed by the authors is a popular book for data mining and machine learning, focusing on data mining, inference, and prediction.

...read moreread less

15.4K

...

Expand

Detecting Novel Associations in Large Data Sets

Chat with Paper

AI Agents for this Paper

Figures

Citations

Applied Predictive Modeling

Microbial interactions: from networks to models

The human microbiome: at the interface of health and disease

Keystone taxa as drivers of microbiome structure and functioning

Host Remodeling of the Gut Microbiome and Metabolic Changes during Pregnancy

References

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Elements of information theory

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

The elements of statistical learning. 2001

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Related Papers (5)

Estimating mutual information.

Elements of information theory

Random Forests

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

A mathematical theory of communication