Web Crawling

doi:10.1561/1500000017

Journal Article10.1561/1500000017

Web Crawling

Christopher Olston, +1 more

- 01 Mar 2010

- Foundations and Trends in Information Re...

- Vol. 4, Iss: 3, pp 175-246

416

TL;DR: The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Book

Data-Intensive Text Processing with MapReduce

Jimmy Lin, +1 more

- 30 Apr 2010

TL;DR: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model using the open-source Hadoop implementation, with a focus on scalability and the tradeoffs associated with distributed processing of large datasets.

...read moreread less

567

Book Chapter•10.1007/978-3-642-58663-7_2

The dawn of the E-lance economy

Thomas W. Malone, +1 more

- 10 May 1999

- Wirtschaftsinformatik und Angewandte Inf...

TL;DR: In 1991, Linus Torvalds, a 21-year-old computer science student at the University of Helsinki, made available on the Internet a kernel of a computer operating system he had written as discussed by the authors.

...read moreread less

356

Patent

Merchant-consumer bridging platform apparatuses, methods and systems

Edward Katzin, +4 more

- 03 Feb 2012

TL;DR: In this article, the MCB-platform components are integrated into transaction records and merchant database updates outputs, and a method is disclosed, comprising: receiving an activity request including merchant information associated with a merchant, retrieving a previously stored merchant record from a database, determining a confidence metric for the merchant information update; retrieving a confidence requirement based on the activity request; determining, within a low-latency processing time-frame, whether the determined confidence metric satisfies the retrieved confidence requirement; performing the requested activity and updating the previously stored record with the verified information update indicia when the determined

...read moreread less

294

Journal Article•10.1016/J.INFFUS.2015.06.002

Opinion Mining and Information Fusion

Jorge A. Balazs, +1 more

- 01 Jan 2016

- Information Fusion

TL;DR: O Opinion Mining is defined and its most fundamental aspects are described, Information Fusion is explained and used to guide fusion processes in Opinion Mining and several Opinion Mining studies that rely at some point on the fusion of information are reviewed.

...read moreread less

256

•Journal Article•10.1177/1094428117722619

Text Mining in Organizational Research

Vladimer Kobayashi, +4 more

- 01 Jul 2018

- Organizational Research Methods

TL;DR: How text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid is described.

...read moreread less

223

...

Expand

References

Book Chapter•10.1016/B978-1-4832-1446-7.50035-2

Learning internal representations by error propagation

David E. Rumelhart, +2 more

- 01 Jan 1988

TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.

...read moreread less

18.9K

Journal Article•10.1016/S0169-7552(98)00110-X

The anatomy of a large-scale hypertextual Web search engine

Sergey Brin, +1 more

- 01 Apr 1998

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

16.6K

•Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

Lawrence Page, +3 more

- 11 Nov 1999

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

16.4K

•Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more

- 01 Jan 1998

- Computer Networks

TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

...read moreread less

13.3K

•Journal Article•10.1145/324133.324140

Authoritative sources in a hyperlinked environment

Jon Kleinberg

- 01 Sep 1999

- Journal of the ACM

TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.

...read moreread less

10.5K

...

Expand

Web Crawling

Chat with Paper

AI Agents for this Paper

Citations

Data-Intensive Text Processing with MapReduce

The dawn of the E-lance economy

Merchant-consumer bridging platform apparatuses, methods and systems

Opinion Mining and Information Fusion

Text Mining in Organizational Research

References

Learning internal representations by error propagation

The anatomy of a large-scale hypertextual Web search engine

The PageRank Citation Ranking : Bringing Order to the Web

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Authoritative sources in a hyperlinked environment

Related Papers (5)

Focused crawling: a new approach to topic-specific Web resource discovery

The anatomy of a large-scale hypertextual Web search engine

Effective page refresh policies for Web crawlers

UbiCrawler: a scalable fully distributed web crawler

An adaptive model for optimizing performance of an incremental web crawler