Structure-driven crawler generation by example

doi:10.1145/1148170.1148223

Proceedings Article10.1145/1148170.1148223

Structure-driven crawler generation by example

Márcio L. A. Vidal, +3 more

- 06 Aug 2006

- pp 292-299

60

TL;DR: A structure-driven approach for generating Web crawlers that requires a minimum effort from users based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page is presented.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.ESWA.2021.115560

An intelligent system for focused crawling from Big Data sources

Ida Bifulco, +4 more

- 01 Dec 2021

- Expert Systems With Applications

TL;DR: In this paper, the authors propose an intelligent system to support organizations in the focused crawling of artefacts (calls for tender, BIMs, equipment, policies, market trends, and so on) of interest from the web, semantically matching them against internal Big Data and knowledge sources, so as to let companies analysts make better strategic decisions.

...read moreread less

35

Book•10.1007/978-3-642-25008-8

Semantic Search over the Web

Roberto De Virgilio, +2 more

- 04 Aug 2012

TL;DR: This book presents an extensive overview of the work done in Semantic Search and other related areas, and presents different types of search, like the exploratory or the path-oriented, alongside methods for their efficient and effective implementation.

...read moreread less

22

Patent

Forum web page clustering based on repetitive regions

Wei Lai

- 16 Apr 2009

TL;DR: In this paper, the authors propose a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster.

...read moreread less

22

Proceedings Article•10.1109/ICTAI.2008.119

Profile-Based Focused Crawler for Social Media-Sharing Websites

Zhiyong Zhang, +1 more

- 03 Nov 2008

TL;DR: A novel profile based focused crawling system for dealing with increasingly popular social media-sharing Web sites that treats users' profiles as ranking criteria for guiding the crawling process.

...read moreread less

19

•Proceedings Article•10.1145/1458502.1458526

Supporting the automatic construction of entity aware search engines

Lorenzo Blanco, +3 more

- 30 Oct 2008

TL;DR: A method to automatically search on the web for pages that publish data representing an instance of a certain conceptual entity that is implemented in a system prototype, which has been used to conduct several experiments that have produced interesting results.

...read moreread less

18

...

Expand

References

•Journal Article•10.1016/S1389-1286(99)00052-3

Focused crawling: a new approach to topic-specific Web resource discovery

Soumen Chakrabarti, +2 more

- 17 May 1999

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.

...read moreread less

1.7K

Proceedings Article•10.1145/872757.872799

Extracting structured data from Web pages

Arvind Arasu, +1 more

- 09 Jun 2003

TL;DR: This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages.

...read moreread less

764

Journal Article•10.1002/SPE.4380210706

Identifying syntactic differences between two programs

Wuu Yang

- 01 Jun 1991

- Software - Practice and Experience

TL;DR: A comparison algorithm is developed that can point out the differences between two programs more accurately than previous text comparison tools and is based on a dynamic programming scheme.

...read moreread less

432

Proceedings Article•10.1145/988672.988740

Automatic web news extraction using tree edit distance

Davi De Castro Reis, +3 more

- 17 May 2004

TL;DR: A domain-oriented approach to Web data extraction is presented and its application to automatically extracting news from Web sites is discussed, based on a highly efficient tree structure analysis that produces very effective results.

...read moreread less

390

Proceedings Article•10.1145/371920.371965

Breadth-first crawling yields high-quality pages

Marc Najork, +1 more

- 01 Apr 2001

TL;DR: This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages and uses the connectivity-based metric PageRank to measure the quality of a page.

...read moreread less

352