Journal Article10.1561/1500000017
Web Crawling
Christopher Olston,Marc Najork +1 more
TL;DR: The fundamental challenges of web crawling are outlined and the state-of-the-art models and solutions are described, and avenues for future work are highlighted.
read more
Abstract: This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Book
Data-Intensive Text Processing with MapReduce
Jimmy Lin,Chris Dyer +1 more
- 30 Apr 2010
TL;DR: This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model using the open-source Hadoop implementation, with a focus on scalability and the tradeoffs associated with distributed processing of large datasets.
The dawn of the E-lance economy
TL;DR: In 1991, Linus Torvalds, a 21-year-old computer science student at the University of Helsinki, made available on the Internet a kernel of a computer operating system he had written as discussed by the authors.
Patent
Merchant-consumer bridging platform apparatuses, methods and systems
Edward Katzin,Phillip Kumnick,Theodore Harris,Patrick Faith,Jennifer Schulz +4 more
- 03 Feb 2012
TL;DR: In this article, the MCB-platform components are integrated into transaction records and merchant database updates outputs, and a method is disclosed, comprising: receiving an activity request including merchant information associated with a merchant, retrieving a previously stored merchant record from a database, determining a confidence metric for the merchant information update; retrieving a confidence requirement based on the activity request; determining, within a low-latency processing time-frame, whether the determined confidence metric satisfies the retrieved confidence requirement; performing the requested activity and updating the previously stored record with the verified information update indicia when the determined
294
Opinion Mining and Information Fusion
TL;DR: O Opinion Mining is defined and its most fundamental aspects are described, Information Fusion is explained and used to guide fusion processes in Opinion Mining and several Opinion Mining studies that rely at some point on the fusion of information are reviewed.
256
Text Mining in Organizational Research
TL;DR: How text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid is described.
223
References
Learning internal representations by error propagation
David E. Rumelhart,Geoffrey E. Hinton,Ronald J. Williams +2 more
- 01 Jan 1988
TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
The anatomy of a large-scale hypertextual Web search engine
Sergey Brin,Lawrence Page +1 more
- 01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
•Proceedings Article
The PageRank Citation Ranking : Bringing Order to the Web
Lawrence Page,Sergey Brin,Rajeev Motwani,Terry Winograd +3 more
- 11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
16.4K
•Journal Article
The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Sergey Brin,Lawrence Page +1 more
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
13.3K
Authoritative sources in a hyperlinked environment
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.