TL;DR: A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1.
Abstract: We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p
TL;DR: In this paper, a simple dictionary with worst case constant lookup time was presented, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al.
Abstract: We present a simple dictionary with worst case constant lookup time, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al. [SIAM J. Comput. 23 (4) (1994) 738-761]. The space usage is similar to that of binary search trees. Besides being conceptually much simpler than previous dynamic dictionaries with worst case constant lookup time, our data structure is interesting in that it does not use perfect hashing, but rather a variant of open addressing where keys can be moved back in their probe sequences. An implementation inspired by our algorithm, but using weaker hash functions, is found to be quite practical. It is competitive with the best known dictionaries having an average case (but no nontrivial worst case) guarantee on lookup time.
TL;DR: The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
Abstract: Hashing large collection of URLs is an inevitable problem in many Web research activities. Through a large scale experiment, three hash functions are compared in this paper. Two metrics were developed for the comparison, which are related to web structure analysis and Web crawling, respectively. The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
TL;DR: In this paper, a method was described that involves hashing a key value to locate a slot in a primary table, then, hashing the key value for locating a first slot in the secondary table, and linearly probing the secondary tables starting from the first slot.
Abstract: A method is described that involves hashing a key value to locate a slot in a primary table, then, hashing the key value to locate a first slot in a secondary table, then, linearly probing the secondary table starting from the first slot.
TL;DR: A survey of existing probabilistic state space exploration methods is given, including bitstate hashing, which was introduced in order to lower the probability of producing a wrong result, but maintaining the memory and runtime efficiency.
Abstract: Several methods have been developed to validate the correctness and performance of hard- and software systems. One way to do this is to model the system and carry out a state space exploration in order to detect all possible states. In this paper, a survey of existing probabilistic state space exploration methods is given. The paper starts with a thorough review and analysis of bitstate hashing, as introduced by Holzmann. The main idea of this initial approach is the mapping of each state onto a specific bit within an array by employing a hash function. Thus a state is represented by a single bit, rather than by a full descriptor. Bitstate hashing is efficient concerning memory and runtime, but it is hampered by the non deterministic omission of states. The resulting positive probability of producing wrong results is due to the fact that the mapping of full state descriptors onto much smaller representatives is not injective. – The rest of the paper is devoted to the presentation, analysis, and comparison of improvements of bitstate hashing, which were introduced in order to lower the probability of producing a wrong result, but maintaining the memory and runtime efficiency. These improvements can be mainly grouped into two categories: The approaches of the first group, the so called multiple hashing schemes, employ multiple hash functions on either a single or on multiple arrays. The approaches of the remaining category follow the idea of hash compaction. I.e. the diverse schemes of this category store a hash value for each detected state, rather than associating a single or multiple bit positions with it, leading to persuasive reductions of the probability of error if compared to the original bitstate hashing scheme.
TL;DR: An Ω(log log n) universal lower bound is proved on the worst-case search time of any two-way linear probing algorithm, where n is the hash table size.
Abstract: Two-way chaining is a novel hashing scheme that uses two independent truly uniform hash functions f and g to insert m keys into a hash table with n chains, where each key x is inserted into the shortest chain among the chains f(x) and g( x), breaking ties randomly. It is known [13, 18] that the worst-case search time of two-way chaining is log2 log n + m/n + O(1), asymptotically almost surely. In this thesis, we study the two-way chaining paradigm under different assumptions.
First, we generalize the result to nonuniform hash functions. We analyze two-way chaining in the fixed density model where the two independent hash functions behave according to two densities defined on the unit interval. When m = Ω(n), we prove that asymptotically almost surely, the worst-case search time is at least log2 log n - O(1). If, in addition, the densities are bounded, then it is at most log2 log n + O( m/n).
Secondly, we consider the off-line version of two-way chaining where all the hashing values available for the m keys are known in advance. For constant k ∈ N , we show that there is a threshold ck such that if m ≤ ckn, then one can assign the keys to the chains so that the maximum search time is at most 2k, asymptotically almost surely. We tightly estimate ck, and prove that it is, in fact, asymptotic to k. Algorithms for finding such assignments are also given.
Thirdly, we utilize the two-way chaining paradigm to design efficient open addressing hashing schemes. We study two-way linear probing algorithms. These are algorithms that employ two independent linear probe sequences to hash the keys. We prove an Ω(log log n) universal lower bound on the worst-case search time of any two-way linear probing algorithm, where n is the hash table size. We show, however, that some simple two-way linear probing algorithms, unexpectedly, have implausible worst-case performances. Subsequently, we present several efficient two-way linear probing algorithms whose performance matches the lower bound. Simulations back up the theoretical results.
TL;DR: A new cache protocol based on consistent hashing, called extended consistent hashing (ECH), is described, which can handle flash access to objects significantly better and yields better worst-case response times and lower load variance.
Abstract: Content caching and location are key enabling technologies for achieving the high throughput needed to sustain current Internet infrastructure, both for peer-to-peer as well as client-server applications. An important aspect of distributed caching techniques is the mapping of data and requests to maximize system throughput while minimizing costs in the presence of network and cache failures. We describe a new cache protocol based on consistent hashing (CH) [D. Karger et al., (1997), (1999)]. Compared to consistent hashing, our protocol, called extended consistent hashing (ECH), can handle flash access to objects significantly better and yields better worst-case response times and lower load variance. Due to multiplicity of client views in a distributed hashing scheme, a single object (or its reference) may be cached at multiple locations. This is referred to as the spread of an object. Consistent hashing maps a request to a cache irrespective of the spread of the requested object. ECH, on the other hand, estimates the spread of an object and randomizes requests over expected spread. In doing so, it amortizes requests over a larger number of caches. While the expected load on target caches in ECH remains the same as consistent hashing (asymptotically optimal), load variance is significantly reduced. We present analytical results as well as simulations to demonstrate significant improvements for querying frequently accessed objects, up to 80% in worst-case response time and 30% in variance of server/target cache loads. We also show excellent correlation between expected and observed results. What makes ECH particularly attractive is that it can be integrated into existing infrastructure based on consistent hashing with minimal software overhead.
TL;DR: The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
Abstract: Hashing large collection of URLs is an inevitable problem in many Web research activities. Through a large scale experiment, three hash functions are compared in this paper. Two metrics were developed for the comparison, which are related to web structure analysis and Web crawling, respectively. The finding is that the well-known function for hashing sequence of symbols, ELFhash, is not very good in this regard, and the other two functions are better and thus recommended.
TL;DR: Improved exponential hashing has the ability to spread table elements more randomly than the widely used double hashing, and at the same time produces full length probe sequences on all table elements.
Abstract: A new and efficient open addressing technique, called improved exponential hashing, is proposed. We show that improved exponential hashing has the ability to spread table elements more randomly than the widely used double hashing, and at the same time produces full length probe sequences on all table elements. We demonstrate experimentally that improved exponential hashing performs significantly better than double hashing for clustered data. Also, some theoretic analysis is provided along with the experimental results.
TL;DR: This work proposes a mathematical analysis to analyze and evaluate the performance of external hashing with separate chain for two cases and provides an approach to clarify the relationship between the insertion order of keys and position that key is located.
Abstract: External hashing with separate chain algorithm is a well-known method to dealing with the collision problem when hashing technique is employed. The performance of external hashing with separate chain depends on the data structure of separate chain. We provide an approach to clarify the relationship between the insertion order of keys and position that key is located. Introducing the probability distribution of frequency of access to each individual key in the separate chain into the analysis of search cost, we propose a mathematical analysis to analyze and evaluate the performance of external hashing with separate chain for two cases. Some experimental results obtained from the proposed formulae are also presented.
TL;DR: A mathematical analysis is proposed to exactly analyze and evaluate the performance of open hashing algorithm and some interesting test results obtained from the proposed formulae are presented.
Abstract: Hashing is one of the most important techniques for sorting and searching. Two problems that how to design a good hash function and how to deal with the collision must be resolved when hashing is applied. First, we provide an evaluation system of hashing algorithm unsing some popular hash functions and show some evaluation results. Continuously, we present an analysis of the probability that collision occurs. Introducing the probability distribution of frequency of access on each individual key in the separate chain into the analysis of search cost, we propose a mathematical analysis to exactly analyze and evaluate the performance of open hashing algorithm. Some interesting test results obtained from the proposed formulae are also presented.
TL;DR: An adaptive hashing scheme is proposed that works on dynamic key sets and still enables keys to be searched in constant time and, if the hash functions are carefully chosen, then the space requirement of the hash structure is O(n).
Abstract: Hashing is an important tool in randomized algorithms, with applications in such diverse fields including information retrieval, data mining, cryptology and parallel algorithms. However, the worst case behavior of a regular hash-based searching is O(n). Perfect hashing is a solution to this problem that offers a worst case performance of O(1) only for the static key set. In this paper we have proposed an adaptive hashing scheme that works on dynamic key sets and still enables keys to be searched in constant time. It has been further established that, if the hash functions are carefully chosen, then the space requirement of the hash structure is O(n).
TL;DR: This work considers open addressing hashing and implements it by using the Robin Hood strategy; that is, in case of collision, the element that has traveled the farthest can stay in the slot and virtually matches the performance of multiple-choice hash methods.
Abstract: We consider open addressing hashing and implement it by using the Robin Hood strategy; that is, in case of collision, the element that has traveled the farthest can stay in the slot. We hash $\sim \alpha n$ elements into a table of size n where each probe is independent and uniformly distributed over the table, and $\alpha < 1$ is a constant. Let $M_n$ be the maximum search time for any of the elements in the table. We show that with probability tending to one, $M_n \in [ \log_2 \log n + \sigma, \log_2 \log n + \tau ]$ for some constants $\sigma, \tau$ depending upon $\alpha$ only. This is an exponential improvement over the maximum search time in case of the standard FCFS (firstcome first served) collision strategy and virtually matches the performance of multiple-choice hash methods.
TL;DR: This paper proposes a geometry-invariant image hashing scheme, which can be employed for content copy detection and tracing and exhaustive experimental results obtained from benchmark attacks have confirmed the performance of the proposed method.
Abstract: Due to the desired non-invasive property, non-data hiding (called media hashing here) is considered to be an alternative to achieve many applications previously accomplished with watermarking. Recently, media hashing techniques for content identification have been gradually emerging. However, none of them are really resistant against geometrical attacks. In this paper, our aim is to propose a geometry-invariant image hashing scheme, which can be employed for content copy detection and tracing. Our system is mainly composed of three components: (i) robust mesh extraction; (iii) mesh-based robust hash extraction; and (iii) hash matching for similarity measurement. Exhaustive experimental results obtained from benchmark attacks have confirmed the performance of the proposed method