TL;DR: A family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network, based on a special kind of hashing that is called consistent hashing.
Abstract: We describe a family of caching protocols for distrib-uted networks that can be used to decrease or eliminate the occurrence of hot spots in the network. Our protocols are particularly designed for use with very large networks such as the Internet, where delays caused by hot spots can be severe, and where it is not feasible for every server to have complete information about the current state of the entire network. The protocols are easy to implement using existing network protocols such as TCP/IP, and require very little overhead. The protocols work with local control, make efficient use of existing resources, and scale gracefully as the network grows. Our caching protocols are based on a special kind of hashing that we call consistent hashing. Roughly speaking, a consistent hash function is one which changes minimally as the range of the function changes. Through the development of good consistent hash functions, we are able to develop caching protocols which do not require users to have a current or even consistent view of the network. We believe that consistent hash functions may eventually prove to be useful in other applications such as distributed name servers and/or quorum systems.
TL;DR: This work studies, by analysis and simulation, the performance of extendible hashing and indicates that it provides an attractive alternative to other access methods, such as balanced trees.
Abstract: Extendible hashing is a new access technique, in which the user is guaranteed no more than two page faults to locate the data associated with a given unique identifier, or key. Unlike conventional hashing, extendible hashing has a dynamic structure that grows and shrinks gracefully as the database grows and shrinks. This approach simultaneously solves the problem of making hash tables that are extendible and of making radix search trees that are balanced. We study, by analysis and simulation, the performance of extendible hashing. The results indicate that extendible hashing provides an attractive alternative to other access methods, such as balanced trees.
TL;DR: UbiCrawler as discussed by the authors is a scalable distributed Web crawler using the Java programming language, which has a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.
Abstract: We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Java APIs, which prompted the authors to partially reimplement them.
TL;DR: In this paper, a record in the file is, in general, found in one access, while the load may stay practically constant up to 90 %. No other algorithms attaining such a performance are known.
Abstract: Linear hashing is a hashing in which the address space may grow or shrink dynamically. A file or a table may then support any number of insertions or deletions without access or memory load performance deterioration. A record in the file is, in general, found in one access, while the load may stay practically constant up to 90 %. A record in a table is found in a mean of 1.7 accesses, while the load is constantly 80 %. No other algorithms attaining such a performance are known.
TL;DR: Only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability, leading to less computation and potentially less need for randomness in practice.
Abstract: A standard technique from the hashing literature is to use two hash functions h1(x) and h2(x) to simulate additional hash functions of the form gi(x) = h1(x) + i h2(x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This leads to less computation and potentially less need for randomness in practice.