Distributed web crawling

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1016/S0169-7552(98)00110-X•

The anatomy of a large-scale hypertextual Web search engine

[...]

Sergey Brin¹, Lawrence Page¹•Institutions (1)

Stanford University¹

1 Apr 1998

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

16,670 citations

Journal Article•10.1016/S1389-1286(99)00052-3•

Focused crawling: a new approach to topic-specific Web resource discovery

[...]

Soumen Chakrabarti¹, Martin van den Berg², Byron Dom³•Institutions (3)

Indian Institute of Technology Bombay¹, FX Palo Alto Laboratory², IBM³

17 May 1999

TL;DR: A new hypertext resource discovery system called a Focused Crawler that is robust against large perturbations in the starting set of URLs, and capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius.

...read moreread less

Abstract: The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines In this paper we describe a new hypertext resource discovery system called a Focused Crawler The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics The topics are specified not using keywords, but using exemplary documents Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links We report on extensive focused-crawling experiments using several topics at different levels of specificity Focused crawling acquires relevant pages steadily while standard crawling quickly loses its way, even though they are started from the same root set Focused crawling is robust against large perturbations in the starting set of URLs It discovers largely overlapping sets of resources in spite of these perturbations It is also capable of exploring out and discovering valuable resources that are dozens of links away from the start set, while carefully pruning the millions of pages that may lie within this same radius Our anecdotes suggest that focused crawling is very effective for building high-quality collections of Web documents on specific topics, using modest desktop hardware © 1999 Published by Elsevier Science BV All rights reserved

...read moreread less

1,790 citations

Journal Article•10.1016/S0169-7552(98)00108-1•

Efficient crawling through URL ordering

[...]

Junghoo Cho¹, Hector Garcia-Molina¹, Lawrence Page¹•Institutions (1)

Stanford University¹

1 Apr 1998

TL;DR: In this paper, the authors study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first, and they show that a good ordering scheme can obtain important pages significantly faster than one without.

...read moreread less

Abstract: In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more "important" pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.

...read moreread less

1,004 citations

Journal Article•10.1002/SPE.587•

UbiCrawler: a scalable fully distributed web crawler

[...]

Paolo Boldi¹, Bruno Codenotti², Massimo Santini, Sebastiano Vigna¹•Institutions (2)

University of Milan¹, University of Iowa²

10 Jul 2004-Software - Practice and Experience

TL;DR: UbiCrawler as discussed by the authors is a scalable distributed Web crawler using the Java programming language, which has a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.

...read moreread less

Abstract: We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Java APIs, which prompted the authors to partially reimplement them.

...read moreread less

648 citations

Journal Article•10.1145/2109205.2109208•

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

[...]

Ali Mesbah¹, Arie van Deursen², Stefan Lenselink²•Institutions (2)

University of British Columbia¹, Delft University of Technology²

01 Mar 2012-ACM Transactions on The Web

TL;DR: A novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application.

...read moreread less

Abstract: Using JavaScript and dynamic DOM manipulation on the client side of Web applications is becoming a widespread approach for achieving rich interactivity and responsiveness in modern Web applications. At the same time, such techniques---collectively known as Ajax---shatter the concept of webpages with unique URLs, on which traditional Web crawlers are based. This article describes a novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers. Our algorithm scans the DOM tree, spots candidate elements that are capable of changing the state, fires events on those candidate elements, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application. This inferred model can be used in program comprehension and in analysis and testing of dynamic Web states, for instance, or for generating a static version of the application. In this article, we discuss our sequential and concurrent Ajax crawling algorithms. We present our open source tool called Crawljax, which implements the concepts and algorithms discussed in this article. Additionally, we report a number of empirical studies in which we apply our approach to a number of open-source and industrial Web applications and elaborate on the obtained results.

...read moreread less

382 citations

...

Expand

Year	Papers
2021	1
2019	1
2018	4
2017	11
2016	23
2015	28

Topic Tools

Papers published on a yearly basis

Papers

The anatomy of a large-scale hypertextual Web search engine

Focused crawling: a new approach to topic-specific Web resource discovery

Efficient crawling through URL ordering

UbiCrawler: a scalable fully distributed web crawler

Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

Related Topics (5)

Performance Metrics