Open Access
Web Crawler Architecture.
Marc Najork
- 01 Sep 2009
- pp 3462-3465
TL;DR: In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers.
read more
Abstract: Definition A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the scale of the web In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers Their two main data structures – the “frontier” set of yet-to-be-crawled URLs and the set of discovered URLs – typically do not fit into main memory, so efficient disk-based representations need to be used Finally, the need to be “polite” to content providers and not to overload any particular web server, and a desire to prioritize the crawl towards high-quality pages and to maintain corpus freshness impose additional engineering challenges
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Harvesting Big Data in social science
TL;DR: An architectural framework and methodology to collect Big Data from an electronic Word-of-Mouth website containing user-generated content is outlined and must be also considered together with other complementary disciplines such as data accessing and computing.
59
A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Paris Koloveas,Thanasis Chantzios,Christos Tryfonopoulos,Spiros Skiadopoulos +3 more
- 08 Jul 2019
TL;DR: This work focuses on the information gathering task, and presents a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
28
Mapping and Modeling of Discussions Related to Gastrointestinal Discomfort in French-Speaking Online Forums: Results of a 15-Year Retrospective Infodemiology Study.
Florent Schäfer,Carole Faviez,Paméla Voillot,P. Foulquié,Matthieu Najm,Jean-François Jeanne,Guy Fagherazzi,Stéphane Schück,Boris Le Nevé +8 more
TL;DR: This approach has shown that identifying web-based discussion topics associated with GI discomfort and its perceived factors is feasible and can serve as a complementary source of real-world evidence for caregivers.
A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
TL;DR: In this paper, the authors focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
18
Bot Detection Model using User Agent and User Behavior for Web Log Analysis.
Takamasa Tanaka,Hidekazu Niibori,Shiyingxue Li,Shimpei Nomura,Hiroki Kawashima,Kazuhiko Tsuda +5 more
TL;DR: In this paper, the authors proposed a method to discriminate between the user and the bot's web access log in order to exclude the bot browsing information from the analysis target, which is not uncommon for bots to disguise themselves as if they were showing their attributes to the user.
17
References
The anatomy of a large-scale hypertextual Web search engine
Sergey Brin,Lawrence Page +1 more
- 01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
•Proceedings Article
The Anatomy of Large-scale Hypertextual Web Search Engine
S. Brin
- 01 Jan 1998
TL;DR: We present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext to produce better search results.
9.7K
UbiCrawler: a scalable fully distributed web crawler
TL;DR: UbiCrawler as discussed by the authors is a scalable distributed Web crawler using the Java programming language, which has a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.
The RBSE spider — Balancing effective search against Web load
TL;DR: The RBSE Spider is a mechanism for exploring World Wide Web structure and indexing useful material thereby discovered and is related to the experience in constructing and operating this spider.
Design and implementation of a high-performance distributed Web crawler
V. Shkapenyuk,Torsten Suel +1 more
- 07 Aug 2002
TL;DR: This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.
Related Papers (5)
Marc Najork
- 01 Jan 2018
Milan Pandya
- 01 Jan 2006
Rajender Nath,Naresh Kumar +1 more
- 01 Jan 2012
Nandar Win Min,Aye Nandar Hlaing +1 more
- 01 Jan 2013