Improving classification accuracy using automatically extracted training data

doi:10.1145/1557019.1557143

Proceedings Article10.1145/1557019.1557143

Improving classification accuracy using automatically extracted training data

Ariel Fuxman, +5 more

- 28 Jun 2009

- pp 1145-1154

25

TL;DR: This work investigates how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality, and shows that it can have large accuracy gains using automatically extractedTraining data at much lower cost.

Abstract: Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/1718487.1718496

Precomputing search features for fast and accurate query classification

Venkatesh Ganti, +2 more

- 04 Feb 2010

TL;DR: This paper proposes a new class of features that realizes the benefit of search-based features without high latency, and leverage co-occurrence between the query keywords and tags applied to documents in search results, resulting in a significant boost to web query classification accuracy.

...read moreread less

33

Proceedings Article•10.1145/1995966.1996001

A transfer approach to detecting disease reporting events in blog social media

Avaré Stewart, +2 more

- 06 Jun 2011

TL;DR: A new method for the supervised detection of disease reporting events is proposed that automatically classify outbreak reports to train a supervised learner, and the knowledge acquired from the learning process is then transferred to the task of classifying blogs.

...read moreread less

13

Proceedings Article•10.1145/1835804.1835899

Learning to combine discriminative classifiers: confidence based

Chi-Hoon Lee

- 25 Jul 2010

TL;DR: This paper proposes a novel Ensemble classifier, EnLR, that computes weights of responses from discriminative classifiers and combines their weighted responses to produce a single response for a test instance.

...read moreread less

11

•Journal Article•10.1109/TCAD.2013.2288688

Pre-Silicon Bug Forecast

Qi Guo, +6 more

- 01 Mar 2014

- IEEE Transactions on Computer-Aided Desi...

TL;DR: Experimental results show that with appropriate learning techniques and characteristics, about 90% modules could be correctly predicted as buggy or clean and the number of bugs of each module could also be accurately predicted.

...read moreread less

11

Proceedings Article•10.1145/1935826.1935921

Shopping for products you don't know you need

Srikanth Jagabathula, +2 more

- 09 Feb 2011

TL;DR: A random model by which relationships between search queries are obtained and general conditions under which they can be reconstructed are described, which shows that the query communities found by the algorithm are more interesting and unexpected than a baseline of clustering the query-click graph.

...read moreread less

7

...

Expand

References

•Book

Elements of information theory

Thomas M. Cover, +1 more

- 01 Jan 1991

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.

...read moreread less

52.2K

Semi-Supervised Learning Literature Survey

Xiaojin Zhu

- 01 Jan 2005

TL;DR: This chapter provides background information on agile principles and an overview of three agile methodologies and an underlying assumption in plan-driven processes is that the requirements are relatively stable.

...read moreread less

5.1K

Proceedings Article•10.1145/775047.775067

Optimizing search engines using clickthrough data

Thorsten Joachims

- 23 Jul 2002

TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.

...read moreread less

4.9K

•Proceedings Article•10.3115/981658.981684

Unsupervised word sense disambiguation rivaling supervised methods

David Yarowsky

- 26 Jun 1995

TL;DR: An unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations.

...read moreread less

2.9K

•Proceedings Article•10.3115/1073012.1073017

Scaling to Very Very Large Corpora for Natural Language Disambiguation

Michele Banko, +1 more

- 06 Jul 2001

TL;DR: This paper examines methods for effectively exploiting very large corpora when labeled data comes at a cost, and evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambigsuation.

...read moreread less

818