Open AccessProceedings Article
Crawling deep web content through query forms
Jun Liu,Zhaohui Wu,Lu Jiang,Qinghua Zheng,Xiao Liu +4 more
- 23 Aug 2018
pp 629-637
18
TL;DR: This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method that outperforms existing methods in terms of query capability and applicability.
read more
Abstract: This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method. The query method extends query interface from single textbox to MEP set, and generates local-optimal query by choosing a MEP and a keyword vector of the MEP. Our method overcomes the problem of “data islands” to a certain extent which results from deficiency of current methods. The experimental results on six real-world Deep Web sites show that our method outperforms existing methods in terms of query capability and applicability.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Novel Architecture for Deep Web Crawler
Dilip Kumar Sharma,A. K. Sharma +1 more
TL;DR: To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications Sharma & Sharma, 2009, which is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.
30
A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Paris Koloveas,Thanasis Chantzios,Christos Tryfonopoulos,Spiros Skiadopoulos +3 more
- 08 Jul 2019
TL;DR: This work focuses on the information gathering task, and presents a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
28
Learning Deep Web Crawling with Diverse Features
Lu Jiang,Zhaohui Wu,Qinghua Zheng,Jun Liu +3 more
- 15 Sep 2009
TL;DR: This paper proposes a novel Deep Web crawling method that breaks through the assumption of plain-text search made by existing methods and outperforms the state of the art methods.
25
A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
TL;DR: In this paper, the authors focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
18
References
•Book
The Fractal Geometry of Nature
Benoit B. Mandelbrot
- 01 Jan 1982
TL;DR: This book is a blend of erudition, popularization, and exposition, and the illustrations include many superb examples of computer graphics that are works of art in their own right.
26.1K
Distributed search over the hidden web: hierarchical database sampling and selection
Panagiotis G. Ipeirotis,Luis Gravano +1 more
- 20 Aug 2002
TL;DR: In this paper, the authors present an algorithm to derive content summaries from "uncooperative" databases by using focused query probes, which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases.
•Journal Article
Siphoning Hidden-Web Data through Keyword-Based Interfaces
Luciano Barbosa,Juliana Freire +1 more
Abstract: In this paper, we study the problem of automating the retrieval of data hidden behind simple search interfaces that accept keyword-based queries. Our goal is to automatically retrieve all available results (or, as many as possible). We propose a new approach to siphon hidden data that automatically generates a small set of representative keywords and builds queries which lead to high coverage. We evaluate our algorithms over several real Web sites. Preliminary results indicate our approach is effective: coverage of over 90% is obtained for most of the sites considered.
163
•Proceedings Article
Searching for Hidden-Web Databases
Luciano Barbosa,Juliana Freire +1 more
- 01 Jan 2005
TL;DR: A new crawling strategy to automatically locate hidden-Web databases is proposed which aims to achieve a balance between the two conflicting requirements of this problem: the need to perform a broad search while at the same time avoiding theneed to crawl a large number of irrelevant pages.
149
Automatic complex schema matching across Web query interfaces: A correlation mining approach
Bin He,Kevin Chen-Chuan Chang +1 more
TL;DR: This article develops the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction, and develops a novel “ensemble” approach, which creates an ensemble of DCM matchers by randomizing the schema data into many trials and aggregating their ranked results by taking majority voting.
Related Papers (5)
[...]
Tomasz Tylenda,Temat referatu,Podstawy crawlingu,crawlingu +3 more
- 01 Jan 2006
Denis Shestakov
- 08 Jul 2013
Sally M. El-Ghamrawy
- 01 Jan 2006
Chandni Saini,Vinay Arora +1 more
- 11 Aug 2016