Proceedings Article10.1109/ICDM.2010.20
Assessing Data Mining Results on Matrices with Randomization
Markus Ojala
- 13 Dec 2010
- pp 959-964
TL;DR: This paper proposes a new approach for randomizing matrices containing features measured in different scales and provides an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given.
read more
Abstract: Randomization is a general technique for evaluating the significance of data analysis results. In randomization-based significance testing, a result is considered to be interesting if it is unlikely to obtain as good result on random data sharing some basic properties with the original data. Recently, the randomization approach has been applied to assess data mining results on binary matrices and limited types of real-valued matrices. In these works, the row and column value distributions are approximately preserved in randomization. However, the previous approaches suffer from various technical and practical shortcomings. In this paper, we give solutions to these problems and introduce a new practical algorithm for randomizing various types of matrices while preserving the row and column value distributions more accurately. We propose a new approach for randomizing matrices containing features measured in different scales. Compared to previous work, our approach can be applied to assess data mining results on different types of real-life matrices containing dissimilar features, nominal values, non-Gaussian value distributions, missing values and sparse structure. We provide an easily usable implementation that does not need problematic manual tuning as theoretically justified parameter values are given. We perform extensive experiments on various real-life datasets showing that our approach produces reasonable results on practically all types of matrices while being easy and fast to use.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Knowledge discovery interestingness measures based on unexpectedness
TL;DR: Different methods for assessing the unexpectedness of patterns with a special focus on frequent itemsets, tiles, association rules, and classification rules are surveyed, namely, syntactical and probabilistic approaches.
32
Live and learn from mistakes: A lightweight system for document classification
TL;DR: The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naive Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.
27
Computing exact permutation p-values for association rules
TL;DR: This paper proposes an algorithm called Exact Permutation p-values for Association Rules (EPAR) to calculate the exact p- values of all tested rules and demonstrates that EPAR can successfully alleviate the disadvantages and outperform the direct permutation-based method over several performance measures.
20
•Journal Article
Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data
TL;DR: Empirical evaluation shows that iterative scoring effectively reduces redundancy in ranking candidate tiles--showing the applicability of the model for a range of data mining fields aimed at discovering structure in real-valued data.
12
Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data
Kleanthis-Nikolaos Kontonasios,Jilles Vreeken,Tijl De Bie +2 more
- 23 Sep 2013
TL;DR: In this article, the maximum entropy principle is used to model real-valued data, where statistics on arbitrary sets of cells as background knowledge are used to assess the likelihood of values and verify the significance of possibly overlapping structures discovered in the data.
References
Controlling the false discovery rate: a practical and powerful approach to multiple testing
Yoav Benjamini,Yosef Hochberg +1 more
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
The Kolmogorov-Smirnov Test for Goodness of Fit
TL;DR: In this paper, the maximum difference between an empirical and a hypothetical cumulative distribution is calculated, and confidence limits for a cumulative distribution are described, showing that the test is superior to the chi-square test.
5.9K
•Book
Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses
Phillip I. Good
- 22 Dec 2012
TL;DR: This book provides a step-by-step manual on the application of permutation tests in biology, medicine, science, and engineering and shows how the problems of missing and censored data, nonresponders, after thefact covariates, and outliers may be handled.
1.8K
Eigentaste: A Constant Time Collaborative Filtering Algorithm
TL;DR: This work compares Eigentaste to alternative algorithms using data from Jester, an online joke recommending system, and uses the Normalized Mean Absolute Error (NMAE) measure to compare performance of different algorithms.