Open AccessProceedings Article
Attribute Relation Extraction from Template-inconsistent Semi-structured Text by Leveraging Site-level Knowledge
Yang Liu,Fang Liu,Siwei Lai,Kang Liu,Guangyou Zhou,Jun Zhao +5 more
- 01 Oct 2013
- pp 1097-1101
TL;DR: A novel method to leverage sitelevel knowledge for attribute-value extraction from semistructured text with consistent templates, which uses a graph-based random walk model to acquire site-level knowledge and can improve the extraction performance significantly.
read more
Abstract: A variety of methods have been proposed for attribute-value extraction from semistructured text with consistent templates (strict semi-text). However, when the templates in semi-structured text are inconsistent (weak semi-text), these methods will work poorly. To overcome the templateinconsistent problem, in this paper, we proposed a novel method to leverage sitelevel knowledge for attribute-value extraction. First, we use a graph-based random walk model to acquire site-level knowledge. Then we utilize such knowledge to identify weak semi-text in each page and extract attribute-value pairs. The experiments show that, comparing to the baseline method which does not utilize sitelevel knowledge, our method can improve the extraction performance significantly.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
Fast Random Walk with Restart and Its Applications
Hanghang Tong,Christos Faloutsos,Jia-Yu Pan +2 more
- 18 Dec 2006
TL;DR: The heart of the approach is to exploit two important properties shared by many real graphs: linear correlations and block- wise, community-like structure and exploit the linearity by using low-rank matrix approximation, and the community structure by graph partitioning, followed by the Sherman- Morrison lemma for matrix inversion.
•Proceedings Article
Towards automatic data extraction from large web sites
Valter Crescenzi,Giansalvatore Mecca,Paolo Merialdo +2 more
- 01 Jan 2001
Abstract: The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and dierences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.
994
Extracting structured data from Web pages
Arvind Arasu,Hector Garcia-Molina +1 more
- 09 Jun 2003
TL;DR: This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages.
Recovering semantics of tables on the web
Petros Venetis,Alon Halevy,Jayant Madhavan,Marius Pasca,Warren Shen,Fei Wu,Gengxin Miao,Chung Wu +7 more
- 01 Jun 2011
TL;DR: A system that attempts to recover the semantics of tables by enriching the table with additional annotations, which leverages a database of class labels and relationships automatically extracted from the Web.
Character-level Analysis of Semi-Structured Documents for Set Expansion
Richard C. Wang,William W. Cohen +1 more
- 06 Aug 2009
TL;DR: This paper illustrated in detail the construction of character-level wrappers for set expansion implemented in SEAL and demonstrated a technique that extends SEAL to learn binary relational concepts from only two seeds, thus demonstrating language-independence.
44