Book Chapter10.1007/978-3-540-24840-8_37
Distributed Data Mining vs. Sampling Techniques: A Comparison
TL;DR: An overview of the most common sampling techniques and a new technique of distributed data-mining based on rule set models, where the aggregation technique is based on a confidence coefficient associated with each rule and on very small samples from each database.
read more
Abstract: To address the of mining a huge volume of geographically distributed databases, we propose two approaches. The first one is to download only a sample of each database. The second option is to mine each distributed database remotely and to download the resulting models to a central site and then aggregate these models. In this paper, we present an overview of the most common sampling techniques. We then present a new technique of distributed data-mining based on rule set models, where the aggregation technique is based on a confidence coefficient associated with each rule and on very small samples from each database. Finally, we present a comparison between the best sampling techniques that we found in the literature, and our approach of model aggregation.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Research on application of data mining methods to diagnosing gastric cancer
Arnis Kirshners,Serge Parshutin,Marcis Leja +2 more
- 13 Jul 2012
TL;DR: This research reveals several possibilities of application of data mining methods to diagnosing gastric cancer, which is the fourth leading cancer type in incidence after the breast, lung and colorectal cancers.
6
Distributed data mining system based on multi-agent communication mechanism
Sung Gook Kim,Kyeong Deok Woo,Jerzy Bala,Sung Wook Baik +3 more
- 23 Jun 2010
TL;DR: This paper presents an overview of a distributed data mining system developed according to two approaches; 1) distributed data modeling and 2) distributed decision making.
3
Le forage distribué des données : une approche basée sur l'agrégation et le raffinement de modèles
Mohamed Aoun-Allah
- 01 Jan 2006
TL;DR: This research proposes a distributed data mining approach, aggregating and refining models from geographically dispersed sites to create a metaclassifier, improving efficiency and providing a unified view of the data set.
3
Adaptive scheduling for adaptive sampling in pos taggers construction
TL;DR: An adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers by analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time.
3
References
•Posted Content
A Sequential Algorithm for Training Text Classifiers
David D. Lewis,William A. Gale +1 more
TL;DR: An algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task and reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
2.7K
Improved use of continuous attributes in C4.5
TL;DR: A reported weakness of C4.5 in domains with continuous attributes is addressed by modifying the formation and evaluation of tests on continuous attributes with an MDL-inspired penalty, leading to smaller decision trees with higher predictive accuracies.
A sequential algorithm for training text classifiers
David D. Lewis,William A. Gale +1 more
- 01 Aug 1994
TL;DR: In this article, an algorithm for sequential sampling during machine learning of statistical classifiers was developed and tested on a newswire text categorization task, which reduced by as much as 500-fold the amount of training data that would have to be manually classified to achieve a given level of effectiveness.
1.9K
Cancer Diagnosis Via Linear Programming
Olvi L. Mangasarian,William H. Wolberg +1 more
- 01 Jan 1990
678