Open AccessProceedings Article
Declarative information extraction using datalog with embedded extraction predicates
Warren Shen,AnHai Doan,Jeffrey F. Naughton,Raghu Ramakrishnan +3 more
- 23 Sep 2007
- pp 1033-1044
TL;DR: This paper argues that developing information extraction programs using Datalog with embedded procedural extraction predicates is a good way to proceed, and shows how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework.
read more
Abstract: In this paper we argue that developing information extraction (IE) programs using Datalog with embedded procedural extraction predicates is a good way to proceed. First, compared to current ad-hoc composition using, e.g., Perl or C++, Datalog provides a cleaner and more powerful way to compose small extraction modules into larger programs. Thus, writing IE programs this way retains and enhances the important advantages of current approaches: programs are easy to understand, debug, and modify. Second, once we write IE programs in this framework, we can apply query optimization techniques to them. This gives programs that, when run over a variety of data sets, are more efficient than any monolithic program because they are optimized based on the statistics of the data on which they are invoked. We show how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework, then provide initial solutions. Extensive experiments over real-world data demonstrate that optimization is indeed vital for IE programs and that we can effectively optimize IE programs written in this proposed framework.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Information Extraction
Sunita Sarawagi
- 01 Mar 2008
TL;DR: A taxonomy of the field is created along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced to survey techniques for optimizing the various steps in an information extraction pipeline.
680
•Proceedings Article
Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
Laura Chiticariu,Yunyao Li,Frederick Reiss +2 more
- 01 Oct 2013
TL;DR: A case is made for the importance of rule-based IE to industry practitioners and a research agenda is laid out in advancing the state-of-theart in rule- based IE systems which has the potential to bridge the gap between academic research and industry practice.
296
•Proceedings Article
SystemT: An Algebraic Approach to Declarative Information Extraction
Laura Chiticariu,Rajasekar Krishnamurthy,Yunyao Li,Sriram Raghavan,Frederick Reiss,Shivakumar Vaithyanathan +5 more
- 11 Jul 2010
TL;DR: A rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars, SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules.
From information to knowledge: harvesting entities and relationships from web sources
Gerhard Weikum,Martin Theobald +1 more
- 06 Jun 2010
TL;DR: This tutorial discusses state-of-the-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting, to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall.
•Posted Content
Incremental Knowledge Base Construction Using DeepDive
TL;DR: This work describes DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and presents techniques to make the KBC process more efficient, and proposes two methods for incremental inference, based, respectively, on sampling and variational techniques.
175
References
Efficient string matching: an aid to bibliographic search
TL;DR: A simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text that has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
UIMA: an architectural approach to unstructured information processing in the corporate research environment
David A. Ferrucci,Adam Lally +1 more
TL;DR: A general introduction to U IMA is given focusing on the design points of its analysis engine architecture and how UIMA is helping to accelerate research and technology transfer is discussed.
1K
A brief survey of web data extraction tools
Alberto H. F. Laender,Berthier Ribeiro-Neto,Altigran Soares da Silva,Juliana Silveira Teixeira +3 more
- 01 Jun 2002
TL;DR: A taxonomy for characterizing Web data extraction fools is proposed, a survey of major web data extraction tools described in the literature is briefly surveyed, and a qualitative analysis of them is provided.
854
The Volcano optimizer generator: extensibility and efficient search
Goetz Graefe,William J. McKenna +1 more
- 19 Apr 1993
TL;DR: The Volcano project, which provides efficient, extensible tools for query and request processing, particularly for object-oriented and scientific database systems, is reviewed, and it is shown that the search engine of the Volcano optimizer generator is more extensible and powerful.
•Book
Foundations of Databases: The Logical Level
Serge Abiteboul,Richard Hull,Victor Vianu +2 more
- 01 Jan 1995
TL;DR: Foundations of Databases presents indepth coverage of this theory and surveys several emerging topics and presents a unifying and contemporary perspective on the field.
445