Proceedings Article10.1145/1557019.1557143
Improving classification accuracy using automatically extracted training data
Ariel Fuxman,Anitha Kannan,Andrew Goldberg,Rakesh Agrawal,Panayiotis Tsaparas,John C. Shafer +5 more
- 28 Jun 2009
- pp 1145-1154
TL;DR: This work investigates how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality, and shows that it can have large accuracy gains using automatically extractedTraining data at much lower cost.
read more
Abstract: Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Precomputing search features for fast and accurate query classification
Venkatesh Ganti,Arnd Christian König,Xiao Li +2 more
- 04 Feb 2010
TL;DR: This paper proposes a new class of features that realizes the benefit of search-based features without high latency, and leverage co-occurrence between the query keywords and tags applied to documents in search results, resulting in a significant boost to web query classification accuracy.
A transfer approach to detecting disease reporting events in blog social media
Avaré Stewart,Matthew Smith,Wolfgang Nejdl +2 more
- 06 Jun 2011
TL;DR: A new method for the supervised detection of disease reporting events is proposed that automatically classify outbreak reports to train a supervised learner, and the knowledge acquired from the learning process is then transferred to the task of classifying blogs.
13
Learning to combine discriminative classifiers: confidence based
Chi-Hoon Lee
- 25 Jul 2010
TL;DR: This paper proposes a novel Ensemble classifier, EnLR, that computes weights of responses from discriminative classifiers and combines their weighted responses to produce a single response for a test instance.
11
Pre-Silicon Bug Forecast
TL;DR: Experimental results show that with appropriate learning techniques and characteristics, about 90% modules could be correctly predicted as buggy or clean and the number of bugs of each module could also be accurately predicted.
Shopping for products you don't know you need
Srikanth Jagabathula,Nina Mishra,Sreenivas Gollapudi +2 more
- 09 Feb 2011
TL;DR: A random model by which relationships between search queries are obtained and general conditions under which they can be reconstructed are described, which shows that the query communities found by the algorithm are more interesting and unexpected than a baseline of clustering the query-click graph.
References
•Book
Elements of information theory
Thomas M. Cover,Joy A. Thomas +1 more
- 01 Jan 1991
TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Semi-Supervised Learning Literature Survey
Xiaojin Zhu
- 01 Jan 2005
TL;DR: This chapter provides background information on agile principles and an overview of three agile methodologies and an underlying assumption in plan-driven processes is that the requirements are relatively stable.
Optimizing search engines using clickthrough data
Thorsten Joachims
- 23 Jul 2002
TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.
4.9K
Unsupervised word sense disambiguation rivaling supervised methods
David Yarowsky
- 26 Jun 1995
TL;DR: An unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations.
Scaling to Very Very Large Corpora for Natural Language Disambiguation
Michele Banko,Eric D. Brill +1 more
- 06 Jul 2001
TL;DR: This paper examines methods for effectively exploiting very large corpora when labeled data comes at a cost, and evaluates the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambigsuation.