Proceedings Article10.1145/1935826.1935904
Web-scale table census and classification
Eric Crestan,Patrick Pantel +1 more
- 09 Feb 2011
- pp 545-554
TL;DR: Empirical evidence is shown, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines and a detailed feature analysis is presented that outlines the most salient features for each table type.
read more
Abstract: We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Knowledge Graphs
Aidan Hogan,Eva Blomqvist,Michael Cochez,Claudia d'Amato,Gerard de Melo,Claudio Gutierrez,José Emilio Labra Gayo,Sabrina Kirrane,Sebastian Neumaier,Axel Polleres,Roberto Navigli,Axel-Cyrille Ngonga Ngomo,Sabbir M. Rashid,Anisa Rula,Lukas Schmelzeisen,Juan F. Sequeda,Steffen Staab,Antoine Zimmermann +17 more
TL;DR: The historical events that lead to the interweaving of data and knowledge are tracked to help improve knowledge and understanding of the world around us.
1.3K
Information extraction meets the semantic web: a survey
TL;DR: Millennium Institute for Foundational Research on Data (IMFD) Comision Nacional de Investigacion Cientifica y Tecnologica (CONICYT), CONICyT FONDECYT: 1181896
222
Web Table Extraction, Retrieval, and Augmentation: A Survey
Shuo Zhang,Krisztian Balog +1 more
TL;DR: The objective of this survey is to synthesize and present two decades of research on web tables into six main categories of information access tasks: table extraction, table interpretation, table search, question answering, knowledge base augmentation, and table augmentation.
103
Ad Hoc Table Retrieval using Semantic Similarity
Shuo Zhang,Krisztian Balog +1 more
TL;DR: In this article, the authors address the problem of ad hoc table retrieval by answering a keyword query with a ranked list of tables, and propose a method for performing semantic matching between queries and tables.
92
TUTA: Tree-based Transformers for Generally Structured Table Pre-training
Zhiruo Wang,Haoyu Dong,Ran Jia,Jia Li,Zhiyi Fu,Shi Han,Dongmei Zhang +6 more
- 14 Aug 2021
TL;DR: TUTA as discussed by the authors proposes a tree-based attention and position embedding to better capture the spatial and hierarchical information of tables, and devise three progressive pre-training objectives to enable representations at the token, cell, and table levels.
79
References
Greedy function approximation: A gradient boosting machine.
TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.
Towards domain-independent information extraction from web tables
Wolfgang Gatterbauer,Paul Bohunsky,Marcus Herzog,Bernhard Krüpl,Bernhard Pollak +4 more
- 08 May 2007
TL;DR: This paper shifts attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen and believes that this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web".
A machine learning based approach for table detection on the web
Yalin Wang,Jianying Hu +1 more
- 07 May 2002
TL;DR: A machine learning based approach to classify each given table entity as either genuine or non-genuine, and designed a novel web document table ground truthing protocol and used it to build a large table ground truth database.
239
•Proceedings Article
Identifying synonyms among distributionally similar words
Dekang Lin,Shaojun Zhao,Lijuan Qin,Ming Zhou +3 more
- 09 Aug 2003
TL;DR: This work presents two methods for identifying synonyms among distributionally similar words and presents two approaches to compute similarities between words based on their distributions in contexts.
Recent advances in predictive (machine) learning
TL;DR: This paper provides an introduction to these two new methods tracing their respective ancestral roots to standard kernel methods and ordinary decision trees.