Proceedings Article10.1145/1148170.1148223
Structure-driven crawler generation by example
Márcio L. A. Vidal,Altigran Soares da Silva,Edleno Silva de Moura,João M. B. Cavalcanti +3 more
- 06 Aug 2006
- pp 292-299
60
TL;DR: A structure-driven approach for generating Web crawlers that requires a minimum effort from users based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page is presented.
read more
Abstract: Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
An intelligent system for focused crawling from Big Data sources
TL;DR: In this paper, the authors propose an intelligent system to support organizations in the focused crawling of artefacts (calls for tender, BIMs, equipment, policies, market trends, and so on) of interest from the web, semantically matching them against internal Big Data and knowledge sources, so as to let companies analysts make better strategic decisions.
35
Semantic Search over the Web
Roberto De Virgilio,Francesco Guerra,Yannis Velegrakis +2 more
- 04 Aug 2012
TL;DR: This book presents an extensive overview of the work done in Semantic Search and other related areas, and presents different types of search, like the exploratory or the path-oriented, alongside methods for their efficient and effective implementation.
22
Patent
Forum web page clustering based on repetitive regions
Wei Lai
- 16 Apr 2009
TL;DR: In this paper, the authors propose a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster.
22
Profile-Based Focused Crawler for Social Media-Sharing Websites
Zhiyong Zhang,Olfa Nasraoui +1 more
- 03 Nov 2008
TL;DR: A novel profile based focused crawling system for dealing with increasingly popular social media-sharing Web sites that treats users' profiles as ranking criteria for guiding the crawling process.
19
Supporting the automatic construction of entity aware search engines
Lorenzo Blanco,Valter Crescenzi,Paolo Merialdo,Paolo Papotti +3 more
- 30 Oct 2008
TL;DR: A method to automatically search on the web for pages that publish data representing an instance of a certain conceptual entity that is implemented in a system prototype, which has been used to conduct several experiments that have produced interesting results.
References
Focused crawling: a new approach to topic-specific Web resource discovery
Soumen Chakrabarti,Martin van den Berg,Byron Dom +2 more
- 17 May 1999
TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.
Extracting structured data from Web pages
Arvind Arasu,Hector Garcia-Molina +1 more
- 09 Jun 2003
TL;DR: This paper presents an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages.
Identifying syntactic differences between two programs
TL;DR: A comparison algorithm is developed that can point out the differences between two programs more accurately than previous text comparison tools and is based on a dynamic programming scheme.
432
Automatic web news extraction using tree edit distance
Davi De Castro Reis,Paulo B. Golgher,Altigran Soares da Silva,Alberto H. F. Laender +3 more
- 17 May 2004
TL;DR: A domain-oriented approach to Web data extraction is presented and its application to automatically extracting news from Web sites is discussed, based on a highly efficient tree structure analysis that produces very effective results.
Breadth-first crawling yields high-quality pages
Marc Najork,Janet L. Wiener +1 more
- 01 Apr 2001
TL;DR: This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages and uses the connectivity-based metric PageRank to measure the quality of a page.
352
Related Papers (5)
Yan Guo,Kui Li,Kai Zhang,Gang Zhang +3 more
- 18 Dec 2006
Gurmeet Singh Manku,Arvind Jain,Anish Das Sarma +2 more
- 08 May 2007
Sergey Brin,Lawrence Page +1 more
- 01 Apr 1998