Assessing Data Mining Results on Matrices with Randomization

doi:10.1109/ICDM.2010.20

Proceedings Article10.1109/ICDM.2010.20

Assessing Data Mining Results on Matrices with Randomization

Markus Ojala

- 13 Dec 2010

- pp 959-964

19

TL;DR: This paper proposes a new approach for randomizing matrices containing features measured in different scales and provides an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given.

Abstract: Randomization is a general technique for evaluating the significance of data analysis results. In randomization-based significance testing, a result is considered to be interesting if it is unlikely to obtain as good result on random data sharing some basic properties with the original data. Recently, the randomization approach has been applied to assess data mining results on binary matrices and limited types of real-valued matrices. In these works, the row and column value distributions are approximately preserved in randomization. However, the previous approaches suffer from various technical and practical shortcomings. In this paper, we give solutions to these problems and introduce a new practical algorithm for randomizing various types of matrices while preserving the row and column value distributions more accurately. We propose a new approach for randomizing matrices containing features measured in different scales. Compared to previous work, our approach can be applied to assess data mining results on different types of real-life matrices containing dissimilar features, nominal values, non-Gaussian value distributions, missing values and sparse structure. We provide an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given. We perform extensive experiments on various real-life datasets showing that our approach produces reasonable results on practically all types of matrices while being easy and fast to use.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1002/WIDM.1063

Knowledge discovery interestingness measures based on unexpectedness

Kleanthis-Nikolaos Kontonasios, +2 more

- 01 Sep 2012

- Wiley Interdisciplinary Reviews-Data Min...

TL;DR: Different methods for assessing the unexpectedness of patterns with a special focus on frequent itemsets, tiles, association rules, and classification rules are surveyed, namely, syntactical and probabilistic approaches.

...read moreread less

32

Journal Article•10.1016/J.IPM.2012.02.001

Live and learn from mistakes: A lightweight system for document classification

Yevgen Borodin, +4 more

- 01 Jan 2013

- Information Processing and Management

TL;DR: The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.

...read moreread less

27

Journal Article•10.1016/J.INS.2016.01.094

Computing exact permutation p-values for association rules

Jun Wu, +5 more

- 10 Jun 2016

- Information Sciences

TL;DR: This paper proposes an algorithm called Exact Permutation p-values for Association Rules (EPAR) to calculate the exact p- values of all tested rules and demonstrates that EPAR can successfully alleviate the disadvantages and outperform the direct permutation-based method over several performance measures.

...read moreread less

20

•Journal Article

Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data

Kleanthis-Nikolaos Kontonasios, +2 more

- 01 Jan 2013

- Lecture Notes in Computer Science

TL;DR: Empirical evaluation shows that iterative scoring effectively reduces redundancy in ranking candidate tiles--showing the applicability of the model for a range of data mining fields aimed at discovering structure in real-valued data.

...read moreread less

12

•Book Chapter•10.1007/978-3-642-40991-2_17

Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data

Kleanthis-Nikolaos Kontonasios, +2 more

- 23 Sep 2013

TL;DR: In this article, the maximum entropy principle is used to model real-valued data, where statistics on arbitrary sets of cells as background knowledge are used to assess the likelihood of values and verify the significance of possibly overlapping structures discovered in the data.

...read moreread less

12

...

Expand

References

Journal Article•10.1111/J.2517-6161.1995.TB02031.X

Controlling the false discovery rate: a practical and powerful approach to multiple testing

Yoav Benjamini, +1 more

- 01 Jan 1995

- Journal of the royal statistical society...

TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.

...read moreread less

104.5K

UCI Machine Learning Repository

A. Asuncion

- 01 Jan 2007

24.3K

Journal Article•10.2307/2280095

The Kolmogorov-Smirnov Test for Goodness of Fit

Frank J. Massey

- 01 Mar 1951

- Journal of the American Statistical Asso...

TL;DR: In this paper, the maximum difference between an empirical and a hypothetical cumulative distribution is calculated, and confidence limits for a cumulative distribution are described, showing that the test is superior to the chi-square test.

...read moreread less

5.9K

•Book

Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses

Phillip I. Good

- 22 Dec 2012

TL;DR: This book provides a step-by-step manual on the application of permutation tests in biology, medicine, science, and engineering and shows how the problems of missing and censored data, nonresponders, after thefact covariates, and outliers may be handled.

...read moreread less

1.8K

Journal Article•10.1023/A:1011419012209

Eigentaste: A Constant Time Collaborative Filtering Algorithm

Ken Goldberg, +3 more

- 01 Jul 2001

- Information Retrieval

TL;DR: This work compares Eigentaste to alternative algorithms using data from Jester, an online joke recommending system, and uses the Normalized Mean Absolute Error (NMAE) measure to compare performance of different algorithms.

...read moreread less

1.8K