Web Crawler Architecture.

Open Access

Web Crawler Architecture.

- 01 Sep 2009

- pp 3462-3465

39

TL;DR: In order to crawl a substantial fraction of the “surface web” in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.CSI.2016.02.003

Harvesting Big Data in social science

M. Olmedilla, +2 more

- 01 May 2016

- Computer Standards & Interfaces

TL;DR: An architectural framework and methodology to collect Big Data from an electronic Word-of-Mouth website containing user-generated content is outlined and must be also considered together with other complementary disciplines such as data accessing and computing.

...read moreread less

59

•Proceedings Article•10.1109/SERVICES.2019.00016

A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence

Paris Koloveas, +3 more

- 08 Jul 2019

TL;DR: This work focuses on the information gathering task, and presents a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.

...read moreread less

28

•Journal Article•10.2196/17247

Mapping and Modeling of Discussions Related to Gastrointestinal Discomfort in French-Speaking Online Forums: Results of a 15-Year Retrospective Infodemiology Study.

Florent Schäfer, +8 more

- 03 Nov 2020

- Journal of Medical Internet Research

TL;DR: This approach has shown that identifying web-based discussion topics associated with GI discomfort and its perceived factors is feasible and can serve as a complementary source of real-world evidence for caregivers.

...read moreread less

20

•Proceedings Article•10.1109/SERVICES.2019.00016

A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence

Paris Koloveas, +3 more

- 14 Sep 2021

- arXiv: Cryptography and Security

TL;DR: In this paper, the authors focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.

...read moreread less

18

•Journal Article•10.1016/J.PROCS.2020.09.185

Bot Detection Model using User Agent and User Behavior for Web Log Analysis.

Takamasa Tanaka, +5 more

- 01 Jan 2020

- Procedia Computer Science

TL;DR: In this paper, the authors proposed a method to discriminate between the user and the bot's web access log in order to exclude the bot browsing information from the analysis target, which is not uncommon for bots to disguise themselves as if they were showing their attributes to the user.

...read moreread less

17

...

Expand

References

Journal Article•10.1016/S0169-7552(98)00110-X

The anatomy of a large-scale hypertextual Web search engine

Sergey Brin, +1 more

- 01 Apr 1998

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

16.6K

•Proceedings Article

The Anatomy of Large-scale Hypertextual Web Search Engine

S. Brin

- 01 Jan 1998

TL;DR: We present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext to produce better search results.

...read moreread less

9.7K

Journal Article•10.1002/SPE.587

UbiCrawler: a scalable fully distributed web crawler

Paolo Boldi, +3 more

- 10 Jul 2004

- Software - Practice and Experience

TL;DR: UbiCrawler as discussed by the authors is a scalable distributed Web crawler using the Java programming language, which has a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.

...read moreread less

648

Journal Article•10.1016/S0169-7552(94)90151-1

The RBSE spider — Balancing effective search against Web load

David Eichmann

- 01 Nov 1994

- Computer Networks and Isdn Systems

TL;DR: The RBSE Spider is a mechanism for exploring World Wide Web structure and indexing useful material thereby discovered and is related to the experience in constructing and operating this spider.

...read moreread less

Proceedings Article•10.1109/ICDE.2002.994750

Design and implementation of a high-performance distributed Web crawler

V. Shkapenyuk, +1 more

- 07 Aug 2002

TL;DR: This paper describes the design and implementation of a distributed Web crawler that runs on a network of workstations that scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

...read moreread less