Proceedings Article10.1145/1401890.1401899
Structured entity identification and document categorization: two tasks with one joint model
Indrajit Bhattacharya,Shantanu Godbole,Sachindra Joshi +2 more
- 24 Aug 2008
- pp 25-33
25
TL;DR: A probabilistic generative model for joint entity identification and document categorization is proposed and it is shown how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion.
read more
Abstract: Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two tasks have much to gain from each other. Apart from direct references to entities in a database, such as names of person entities, documents often also contain words that are correlated with discriminative entity attributes, such age-group and income-level of persons. This happens naturally in many enterprise domains such as CRM, Banking, etc. Then, entity identification, which is typically vulnerable against noise and incompleteness in direct references to entities in documents, can benefit from document categorization with respect to such attributes. In return, entity identification enables documents to be categorized according to different label-sets arising from entity attributes without requiring any supervision. In this paper, we propose a probabilistic generative model for joint entity identification and document categorization. We show how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion. Using extensive experiments over real and semi-synthetic data, we demonstrate that the two tasks can benefit immensely from each other when performed jointly using the proposed model.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Collective Graph Identification
TL;DR: This article introduces the problem of graph identification, i.e., discovering the latent graph structure underlying an observed network, and presents a simple, yet novel, approach to address all three subproblems simultaneously, which consists of a collection of Coupled Collective Classifiers that are applied iteratively to propagate inferred information among the subpro problems.
43
Retrieval of Relevant Opinion Sentences for New Products
Dae Hoon Park,Hyun Duk Kim,ChengXiang Zhai,Lifan Guo +3 more
- 09 Aug 2015
TL;DR: This work studies the novel problem of retrieving relevant opinion sentences from the reviews of other products using specifications of a new or unpopular product as query and proposes a popular summarization method and its modified version to solve the problem.
25
Towards combining web classification and web information extraction: a case study
Ping Luo,Fen Lin,Yuhong Xiong,Yong Zhao,Zhongzhi Shi +4 more
- 28 Jun 2009
TL;DR: This paper proposes to combine Web Classification and Web Information Extraction by using a model of Conditional Random Fields (CRFs), which can be used to simultaneously recognize the target Web pages and extract the corresponding metadata.
22
Web metadata extraction and semantic indexing for learning objects extraction
TL;DR: A multi-strategy approach for semantically guided extraction, indexing and search of educational metadata is described; it combines machine learning, concept analysis, and corpus-based natural language processing techniques.
20
Web metadata extraction and semantic indexing for learning objects extraction
John Atkinson,Andrea Gonzalez,Mauricio Munoz,Hernán Astudillo +3 more
- 17 Jun 2013
TL;DR: A new approach to automatic metadata extraction and semantic indexing for educational purposes is proposed to identify learning objects that may assist educators to prepare pedagogical materials from the Web.
20
References
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
•Proceedings Article
Latent Dirichlet Allocation
David M. Blei,Andrew Y. Ng,Michael I. Jordan +2 more
- 03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
Thorsten Joachims
- 21 Apr 1998
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Multitask Learning
Rich Caruana
- 01 Jul 1997
TL;DR: Multi-task Learning (MTL) as mentioned in this paper is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias.
Combining labeled and unlabeled data with co-training
Avrim Blum,Tom M. Mitchell +1 more
- 24 Jul 1998
TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
6.4K