Siphoning Hidden-Web Data through Keyword-Based Interfaces

Open AccessJournal Article

Siphoning Hidden-Web Data through Keyword-Based Interfaces

Luciano Barbosa, +1 more

- 27 May 2010

- Journal of Information and Data Manageme...

- Vol. 1, Iss: 1, pp 133-144

162

Abstract: In this paper, we study the problem of automating the retrieval of data hidden behind simple search interfaces that accept keyword-based queries. Our goal is to automatically retrieve all available results (or, as many as possible). We propose a new approach to siphon hidden data that automatically generates a small set of representative keywords and builds queries which lead to high coverage. We evaluate our algorithms over several real Web sites. Preliminary results indicate our approach is effective: coverage of over 90% is obtained for most of the sites considered.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1561/1500000017

Web Crawling

Christopher Olston, +1 more

- 01 Mar 2010

- Foundations and Trends in Information Re...

TL;DR: The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.

...read moreread less

420

Proceedings Article•10.1145/1242572.1242632

An adaptive crawler for locating hidden-Web entry points

Luciano Barbosa, +1 more

- 08 May 2007

TL;DR: A new framework is proposed whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup andtuning.

...read moreread less

208

Journal Article•10.1145/1897816.1897839

Structured data on the web

Michael Cafarella, +2 more

- 01 Feb 2011

- Communications of The ACM

TL;DR: Fusion Tables is described, a recently launched data-management service that lets users create and visualize structured and easily and emphasizes the ability to collaborate with other data owners.

...read moreread less

119

•Proceedings Article

Harnessing the Deep Web: present and future

Jayant Madhavan, +3 more

- 01 Jan 2009

TL;DR: This paper reports some of the key observations in building the system that exposed content from the Deep Web to web-search users of Google.com and discusses the choice of underlying approach in exposing deep-web content in a search engine.

...read moreread less

106

•Posted Content

Harnessing the Deep Web: Present and Future

Jayant Madhavan, +3 more

- 09 Sep 2009

- arXiv: Databases

TL;DR: In this article, the authors report on where they believe the Deep Web provides value and where it does not, and contrast two very different approaches to exposing Deep-Web content, the surfacing approach that we used, and the virtual integration approach that has been pursued in the data management literature.

...read moreread less

96

...

Expand

References

•Journal Article•10.1016/S1389-1286(99)00052-3

Focused crawling: a new approach to topic-specific Web resource discovery

Soumen Chakrabarti, +2 more

- 17 May 1999

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.

...read moreread less

1.7K

•Journal Article•10.1126/SCIENCE.280.5360.98

Searching the World Wide Web

Steve Lawrence, +1 more

- 03 Apr 1998

- Science

TL;DR: The coverage and recency of the major World Wide Web search engines was analyzed, yielding some surprising results, including a lower bound on the size of the indexable Web of 320 million pages.

...read moreread less

1.1K

•Proceedings Article•10.1109/ICDE.2002.994756

Keyword searching and browsing in databases using BANKS

G. Bhalotia, +4 more

- 26 Feb 2002

TL;DR: BANKS is described, a system which enables keyword-based search on relational databases, together with data and schema browsing, and presents an efficient heuristic algorithm for finding and ranking query results.

...read moreread less

1K

•Proceedings Article•10.1109/ICDE.2002.994693

DBXplorer: a system for keyword-based search over relational databases

Sanjay Agrawal, +2 more

- 07 Aug 2002

TL;DR: DBXplorer, a system that enables keyword-based searches in relational databases using a commercial relational database and Web server and allows users to interact via a browser front-end is discussed.

...read moreread less

887

•Journal Article•10.3998/3336451.0007.104

White Paper: The Deep Web: Surfacing Hidden Value

Michael K. Bergman

- 01 Aug 2001

- Journal of Electronic Publishing

TL;DR: BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content.

...read moreread less

691