TL;DR: Extensive experiments performed on four large datasets with up to one million samples show that the discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art un-supervised hashing methods, especially for longer codes.
Abstract: Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art un-supervised hashing methods, especially for longer codes.
TL;DR: Density Sensitive Hashing (DSH) as discussed by the authors is an extension of locality sensitive hashing (LSH) which avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data.
Abstract: Nearest neighbor search is a fundamental problem in various research fields like machine learning, data mining and pattern recognition. Recently, hashing-based approaches, for example, locality sensitive hashing (LSH), are proved to be effective for scalable high dimensional nearest neighbor search. Many hashing algorithms found their theoretic root in random projection. Since these algorithms generate the hash tables (projections) randomly, a large number of hash tables (i.e., long codewords) are required in order to achieve both high precision and recall. To address this limitation, we propose a novel hashing algorithm called density sensitive hashing (DSH) in this paper. DSH can be regarded as an extension of LSH. By exploring the geometric structure of the data, DSH avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data. Extensive experimental results on real-world data sets have shown that the proposed method achieves better performance compared to the state-of-the-art hashing approaches.
TL;DR: The design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers is presented, and performance results demonstrate that the new hash table design, based around optimistic cuckoo hashing, outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items.
Abstract: Fast concurrent hash tables are an increasingly important building block as we scale systems to greater numbers of cores and threads. This paper presents the design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers. The design arises from careful attention to systems-level optimizations such as minimizing critical section length and reducing interprocessor coherence traffic through algorithm re-engineering. As part of the architectural basis for this engineering, we include a discussion of our experience and results adopting Intel's recent hardware transactional memory (HTM) support to this critical building block. We find that naively allowing concurrent access using a coarse-grained lock on existing data structures reduces overall performance with more threads. While HTM mitigates this slowdown somewhat, it does not eliminate it. Algorithmic optimizations that benefit both HTM and designs for fine-grained locking are needed to achieve high performance.Our performance results demonstrate that our new hash table design---based around optimistic cuckoo hashing---outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items. On a 16-core machine, our hash table executes almost 40 million insert and more than 70 million lookup operations per second.
TL;DR: This paper presents a lock-free cuckoo hashing algorithm that allows mutating operations to operate concurrently with query ones and requires only single word compare-and-swap primitives.
Abstract: This paper presents a lock-free cuckoo hashing algorithm, to the best of our knowledge this is the first lock-free cuckoo hashing in the literature. The algorithm allows mutating operations to operate concurrently with query ones and requires only single word compare-and-swap primitives. Query of items can operate concurrently with others mutating operations, thanks to the two-round query protocol enhanced with a logical clock technique. When an insertion triggers a sequence of key displacements, instead of locking the whole cuckoo path, our algorithm breaks down the chain of relocations into several single relocations which can be executed independently and concurrently with other operations. A fine tuned synchronization and a helping mechanism for relocation are designed. The mechanisms allow high concurrency and provide progress guarantees for the data structure's operations. Our experimental results show that our lock-free cuckoo hashing performs consistently better than two efficient lock-based hashing algorithms, the chained and the hopscotch hash-map, in different access pattern scenarios.
TL;DR: A method for controlling access of a cache includes at least following steps: receiving a memory address, utilizing hashing address logic to perform a programmable hash function upon at least a portion of the memory address to generate a hashing address; and determining an index of the cache based at least partly on the hashing address as discussed by the authors.
Abstract: A method for controlling access of a cache includes at least following steps: receiving a memory address; utilizing a hashing address logic to perform a programmable hash function upon at least a portion of the memory address to generate a hashing address; and determining an index of the cache based at least partly on the hashing address.
TL;DR: In this article, the authors provide precise specifications for J-lanes tree hashing and J-pointers tree hashing, and propose appropriate IVs and demonstrates their performance on the latest processors.
Abstract: j-lanes tree hashing is a tree mode
that splits an input message into j slices, computes j independent digests of each slice, and
outputs the hash value of their concatenation. j-pointers tree hashing is a
similar tree mode that receives, as input, j pointers to j messages (or slices of a single message),
computes their digests and outputs the hash value of their concatenation. Such
modes expose parallelization opportunities in a hashing process that is
otherwise serial by nature. As a result, they have a performance advantage on
modern processor architectures. This paper provides precise specifications for
these hashing modes, proposes appropriate IVs, and demonstrates their
performance on the latest processors. Our hope is that it would be useful for standardization
of these modes.
TL;DR: A Boosting based formulation for supervised learning of the hash functions that is based on Error Correcting Codes is proposed, showing that the training accuracy in Boosting can be considered as a lower bound on the (empirical) Mean Average Precision (mAP) score.
Abstract: One widely-used solution to expedite similarity search of multimedia data is to construct hash functions to map the data into a Hamming space where linear search is known to be fast and often sublinear solutions perform well. In this paper, we propose a Boosting based formulation for supervised learning of the hash functions that is based on Error Correcting Codes. This approach allows us to apply established theoretical results for Boosting in our analysis of our hashing solution. Specifically, we show that the training accuracy in Boosting can be considered as a lower bound on the (empirical) Mean Average Precision (mAP) score. In experiments with three image retrieval benchmarks, the proposed formulation yields significant improvement in mAP over state-of-the-art supervised hashing methods, while using fewer bits in the hash codes.
TL;DR: A novel data-driven hashing method called forest hashing, which utilizes multiple tree structures to perform data hashing by leveraging the index structure of trees, which can significantly improve the hashing efficacy by generating balanced hash buckets.
Abstract: Indexing images and videos using binary hash bits has shown promising results for fast similarity search. Existing datadriven hashing methods learn compact hash codes from the data, but usually with the cost of generating unbalanced hash buckets, thus affecting the search efficiency. We propose a novel data-driven hashing method called forest hashing, which utilizes multiple tree structures to perform data hashing. By leveraging the index structure of trees, we can significantly improve the hashing efficacy by generating balanced hash buckets. Moreover, forest hashing naturally supports scalable coding where more trees can improve the coding quality with a longer code. Last but not the least, our forest hashing can be easily extended for semantic search by integrating semi-supervised label information. Experiments on two benchmark datasets show favorable results compared with the state-of-the-art hashing methods.
TL;DR: This paper defines the practical and formal security model of hashing schemes for graphs, and describes constructions of hashing and perfectly secure hashing of graphs, which are highly efficient for hashing, redaction, and verification of hashes graphs.
Abstract: Use of graph-structured data models is on the rise - in graph databases, in representing biological and healthcare data as well as geographical data. In order to secure graph-structured data, and develop cryptographically secure schemes for graph databases, it is essential to formally define and develop suitable collision resistant one-way hashing schemes and show them they are efficient. The widely used Merkle hash technique is not suitable as it is, because graphs may be directed acyclic ones or cyclic ones. In this paper, we are addressing this problem. Our contributions are: (1) define the practical and formal security model of hashing schemes for graphs, (2) define the formal security model of perfectly secure hashing schemes, (3) describe constructions of hashing and perfectly secure hashing of graphs, and (4) performance results for the constructions. Our constructions use graph traversal techniques, and are highly efficient for hashing, redaction, and verification of hashes graphs. We have implemented the proposed schemes, and our performance analysis on both real and synthetic graph data sets support our claims.
TL;DR: This paper shows that linear probing, a classical collision resolution strategy for hash tables, can be easily made cache-oblivious but it only achieves tq=1+Θ(α/b) even if a truly random hash function is used, and demonstrates that the block probing algorithm achieves t q=1-1/2Ω(b), thus matching the cache-aware bound.
Abstract: The hash table, especially its external memory version, is one of the most important index structures in large databases. Assuming a truly random hash function, it is known that in a standard external hash table with block size b, searching for a particular key only takes expected average t q =1+1/2 Ω(b) disk accesses for any load factor ? bounded away from 1. However, such near-perfect performance is achieved only when b is known and the hash table is particularly tuned for working with such a blocking. In this paper we study if it is possible to build a cache-oblivious hash table that works well with any blocking. Such a hash table will automatically perform well across all levels of the memory hierarchy and does not need any hardware-specific tuning, an important feature in autonomous databases.
We first show that linear probing, a classical collision resolution strategy for hash tables, can be easily made cache-oblivious but it only achieves t q =1+?(?/b) even if a truly random hash function is used. Then we demonstrate that the block probing algorithm (Pagh et al. in SIAM Rev. 53(3):547---558, 2011) achieves t q =1+1/2 Ω(b), thus matching the cache-aware bound, if the following two conditions hold: (a) b is a power of 2; and (b) every block starts at a memory address divisible by b. Note that the two conditions hold on a real machine, although they are not stated in the cache-oblivious model. Interestingly, we also show that neither condition is dispensable: if either of them is removed, the best obtainable bound is t q =1+O(?/b), which is exactly what linear probing achieves.
TL;DR: The authors' experiments revealed a few new phenomena of hashing that might be able to provide heuristics to programmers on how to design software products using hash tables.
Abstract: Hash table is a valuable data structure that is expected to provide constant amortized access time. Although there are a lot of researches on hashing, it seems there is no enough practical study on its stability with large data set. In this paper, we conducted a few experiments to study the performance of hashing with a large set of data and compared the results of different collision approaches. Our experiments revealed a few new phenomena. The experiment results leans to close addressing than open addressing by a huge edge and deem linear probing impractical due to its low performance. When items are randomly distributed with keys in a large space, different hash algorithms might produce similar performance. Increasing randomness in keys does not help hash table performance either. These discoveries might be able to provide heuristics to programmers on how to design software products using hash tables.
TL;DR: This work proposes a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory, and presents a theoretical analysis of the proposed method and describes the design of the hashing implementations.
Abstract: High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured Parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result — the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.
TL;DR: In this paper, a shared forwarding table is maintained in a forwarding plane using a collision detector and a plurality of entries and precedence information, with the precedence information indicating a priority of entries of the first forwarding table.
Abstract: Exemplary methods for maintaining a shared forwarding table 133 in a forwarding plane include a first network device 103 operating in a forwarding plane 106 receiving information associated with a first forwarding table 108 from a second network device 101 operating in a control plane 105, the information including a plurality of entries and precedence information, the precedence information indicating a priority of the plurality of entries of the first forwarding table 108. The method further includes for each entry of the first forwarding table, determining whether the entry should be inserted in the shared forwarding table based on the precedence information of the first forwarding table and precedence information contained in the shared forwarding table, wherein the precedence information contained in the shared forwarding table indicates a priority of each corresponding entry stored in the shared forwarding table. In an embodiment, for each entry in the received information associated with the first forwarding table, the first network device 103 determines a first candidate location in the shared forwarding table in which the entry may be inserted and determines whether the first candidate location already contains another entry (resulting in a "collision"). In response to determining that there is no collision, the first network device 103 inserts the entry of the first forwarding table in the first candidate location. Alternatively, the first network device 103 determines whether the number of collision resolution attempts has reached a predetermined threshold. The first network device 103 includes a table generator 121 for generating forwarding tables as part of shared forwarding table 133, using forwarding table information 108 received from control plane 105. Table generator 121 includes a collision detector 132. Collision detector 132 determines whether the referenced location already contains another entry. Where table generator 121 is configured to generate shared forwarding table 133 as a hash table, collision detector 132 may implement a collision resolution algorithm such as separate chaining, open addressing, coalesced hashing, cuckoo hashing, robin hood hashing, 2-choice hashing, hopscotch hashing.
TL;DR: In this paper, a method that uses a consistent hashing technique to dispatch incoming packets in a stable system prior to adding of a node is provided. But, the method is not suitable for the use of a large number of nodes.
Abstract: A method is provided that uses a consistent hashing technique to dispatch incoming packets in a stable system prior to adding of a node. The method uses a hash table and assigns hash buckets in the table to each network node. A set of fields in each incoming packet is hashed and is used to identify the corresponding hash bucket. The packets are then dispatched to the network nodes based on the nodes' hash buckets. During an observation period, the method identifies the ongoing sessions by creating a bit vector table that is used to identify the old and new sessions during a re-dispatching period. The method uses the consistent hashing method and the probabilistic method dispatch the incoming packets such that each packet that belongs to an old session is dispatched to the same old node that has been processing the other packets of the session.
TL;DR: It is shown that the performance difference between double hashing and fully random hashing appears negligible in the standard balanced allocation paradigm, where each item is placed in the least loaded of d choices.
Abstract: With double hashing, for an item x, one generates two hash values f(x) and g(x), and then uses combinations (f(x) +ig(x)) mod n for i=0,1,2,... to generate multiple hash values from the initial two. We show that the performance difference between double hashing and fully random hashing appears negligible in the standard balanced allocation paradigm, where each item is placed in the least loaded of d choices, as well as several related variants. We perform an empirical study, and consider multiple theoretical approaches. While several techniques can be used to show asymptotic results for the maximum load, we demonstrate how fluid limit methods explain why the behavior of double hashing and fully random hashing are essentially indistinguishable in this context.
TL;DR: Transactions Hashing and Pruning (THP) algorithm overcomes the item set collision problem of DHP algorithm and large hash table problem of PHP algorithm.
Abstract: Hashing & Pruning is very popular association rule mining technique to improve the performance of traditional Apriori algorithm. Hashing technique uses hash function to reduce the size of candidate item set. Direct Hashing & Pruning (DHP), Perfect Hashing &Pruning (PHP) are the basic hashing algorithms. Many algorithms have been also proposed by researchers. All algorithms have their own pros and cons. DHP algorithm suffer from collision and require more database scans to count the frequency of collided item sets. PHP algorithm eliminates collision problem but this algorithm increases the size of hash table which requires large amount of memory space and uses complex hash function. The main objective of this paper is to reduce the number of collision, database scans to count the frequency of collided item sets and to make sure that the size of hash table does not increase. A new algorithm Transaction Hashing and Pruning (THP) is proposed in this paper. THP arranges the item sets into vertical format and after finding out the bucket number of candidate-k item sets, and hashes the transaction id (TID) of that the candidate item set into that bucket. THP algorithm overcomes the item set collision problem of DHP algorithm and large hash table problem of PHP algorithm. Experimental results are also shown in the paper.
TL;DR: Experiments show that the spatial hashing contact detection has a significant improvement in performance.
Abstract: A spatial hashing method was introduced to accelerate the contact detection process in the numerical manifold method. All objects (blocks) in the work space are hashed to a one dimensional hash table based on a spatial grid, then only blocks within the same place in the hash table need to do contact detection. The proposed method has a time complexity of O(n). Experiments show that the spatial hashing contact detection has a significant improvement in performance.
TL;DR: A robust and secure perceptual 3D model hashing function is developed based on a key-dependent shape feature to exhibit robustness against content-preserved attacks and to enable blind-detection without the use of preprocessing techniques for these types of attacks.
Abstract: With the rapid growth of three-dimensional (3D) content, perceptual 3D model hashing will become a solution for the authentication, reliability, and copy detection of 3D content and will continue to be an important aspect of multimedia security in the future. However, perceptual 3D model hashing has not been used as widely as perceptual image or video hashing. In this study, a robust and secure perceptual 3D model hashing function is developed based on a key-dependent shape feature. The main objectives of our hashing function are to exhibit robustness against content-preserved attacks and to enable blind-detection without the use of preprocessing techniques for these types of attacks. In order to achieve these objectives, our hashing projects all of the vertices to the shape coordinates of the shape spectrum descriptor and the curvedness, and then, it segments the shape coordinates into irregular cells and computes the shape features of the cells using a permutation key and a random key. A perceptual hash is generated by binarizing the shape features. Experimental results confirm that the proposed hashing scheme shows robustness against geometrical and topological attacks and provides a unique and secure hash for each model and key.
TL;DR: The proposed locality-sensitive hashing method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.
Abstract: Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.
TL;DR: The proposed chained hashing and Cuckoo hashing methods for modern computers having a lot of CPU cores with exploiting CPU cache line and hardware level lock-free operations outperform the existing methods in most cases and are very scalable in terms of the number ofCPU cores.
Abstract: A hash table is a fundamental data structure implementing an associative memory that maps a key to its associative value Besides, the paradigm of micro-architecture design of CPUs is shifting away from faster uniprocessors toward slower chip multiprocessors In this paper, we propose enhanced chained hashing and Cuckoo hashing methods for modern computers having a lot of CPU cores with exploiting CPU cache line and hardware level lock-free operations The proposed methods outperform the existing methods in most cases and are very scalable in terms of the number of CPU cores In addition, their performances do not degrade much even with a high fill factor (eg, 90 %) Through extensive experiments using Intel 32-core machine, we have shown our proposed methods improve performance compared with the state-of-the-art version of the four exiting major hashing methods of linear, chained, Cuckoo, and Hopscotch
TL;DR: It is shown that a simple but apparently unstudied approach for handling deletions with Robin Hood hashing offers good performance even under high loads.
Abstract: Robin Hood hashing is a variation on open addressing hashing designed to reduce the maximum search time as well as the variance in the search time for elements in the hash table. While the case of insertions only using Robin Hood hashing is well understood, the behavior with deletions has remained open. Here we show that Robin Hood hashing can be analyzed under the framework of finite-level finite-dimensional jump Markov chains. This framework allows us to re-derive some past results for the insertion-only case with some new insight, as well as provide a new analysis for a standard deletion model, where we alternate between deleting a random old key and inserting a new one. In particular, we show that a simple but apparently unstudied approach for handling deletions with Robin Hood hashing offers good performance even under high loads.
TL;DR: This paper presents a new and innovative technique for handling collisions in hash tables based on a multi-dimensional array to minimize or completely remove the empty spaces created within the array.
Abstract: This paper presents a new and innovative technique for handling collisions in hash tables based on a multi-dimensional array. The proposed strategy followed the standard ways of evaluating and implementing algorithms to resolve collisions in hash tables. This technique is an effective way of handling the problem of collisions in hash table slots or cells but at a slight expense of space. It was discussed that an optimal representation of this scheme is to minimize or completely remove the empty spaces created within the array.
TL;DR: This work presents a robust hashing algorithm for 3D mesh data that is built to resist desired alterations of the model as well as malicious attacks intending to prevent correct allocation.
Abstract: 3D models and applications are of utmost interest in both science and industry With the increment of their usage, their number and thereby the challenge to correctly identify them increases Content identification is commonly done by cryptographic hashes However, they fail as a solution in application scenarios such as computer aided design (CAD), scientific visualization or video games, because even the smallest alteration of the 3D model, eg conversion or compression operations, massively changes the cryptographic hash as well Therefore, this work presents a robust hashing algorithm for 3D mesh data The algorithm applies several different bit extraction methods They are built to resist desired alterations of the model as well as malicious attacks intending to prevent correct allocation The different bit extraction methods are tested against each other and, as far as possible, the hashing algorithm is compared to the state of the art The parameters tested are robustness, security and runtime performance as well as False Acceptance Rate (FAR) and False Rejection Rate (FRR), also the probability calculation of hash collision is included The introduced hashing algorithm is kept adaptive eg in hash length, to serve as a proper tool for all applications in practice
TL;DR: NFO is presented, a new and innovative technique for collision resolution based on single dimensional arrays that incorporates certain features to resolve some problems of existing techniques and its performance benefits are significant.
Abstract: paper presents NFO, a new and innovative technique for collision resolution based on single dimensional arrays. Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys and should be seen as an event that can disrupt the normal operations or flow of hash functions computing an index into an array of buckets or slots. Hash tables provide efficient table implementations but then its performance is greatly affected if there are high loads of collisions. This new approach intends to manage these collisions effectively and properly although there are some algorithms for handling collisions currently. NFO incorporates certain features to resolve some problems of existing techniques. The performance of our approach is quantified via analytical modeling and software simulations. Efficient implementations that are easily realizable and productive in modern technologies are discussed. The performance benefits are significant and require machines with moderate memory and speed specifications. Depending on observations of the results of implementation of the proposed approach or technique on a set of real data of several types, all results are registered and analyzed.
TL;DR: This paper proposes a new ranking method named QRank with query-adaptive bitwise weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search, which can achieve up to 17.11\% performance gains over state-of-the-art methods.
Abstract: Recently hash-based nearest neighbor search has become attractive in many applications due to its compressed storage and fast query speed. However, the quantization in the hashing process usually degenerates its discriminative power when using Hamming distance ranking. To enable fine-grained ranking, hash bit weighting has been proved as a promising solution. Though achieving satisfying performance improvement, state-of-the-art weighting methods usually heavily rely on the projection's distribution assumption, and thus can hardly be directly applied to more general types of hashing algorithms. In this paper, we propose a new ranking method named QRank with query-adaptive bitwise weights by exploiting both the discriminative power of each hash function and their complement for nearest neighbor search. QRank is a general weighting method for all kinds of hashing algorithms without any strict assumptions. Experimental results on two well-known benchmarks MNIST and NUS-WIDE show that the proposed method can achieve up to 17.11\% performance gains over state-of-the-art methods.
TL;DR: A novel semi-supervised tag hashing (SSTH) approach that fully incorporates tag information into learning effective hashing function by exploring the correlation between tags and hashing bits and improves the effectiveness of hashing function through orthogonal transformation by minimizing the quantization error.
Abstract: Similarity search is an important technique in many large scale vision applications. Hashing approach becomes popular for similarity search due to its computational and memory efficiency. Recently, it has been shown that the hashing quality could be improved by combining supervised information, e.g. semantic tags/labels, into hashing function learning. However, tag information is not fully exploited in existing unsupervised and supervised hashing methods especially when only partial tags are available. This paper proposes a novel semi-supervised tag hashing (SSTH) approach that fully incorporates tag information into learning effective hashing function by exploring the correlation between tags and hashing bits. The hashing function is learned in a unified learning framework by simultaneously ensuring the tag consistency and preserving the similarities between image examples. An iterative coordinate descent algorithm is designed as the optimization procedure. Furthermore, we improve the effectiveness of hashing function through orthogonal transformation by minimizing the quantization error. Extensive experiments on two large scale image datasets demonstrate the superior performance of the proposed approach over several state-of-the-art hashing methods.
TL;DR: This work exploits the collision detection mechanism used by hash maps, unifying the two phases of “seed and extend” into a single operation that executes in close to O(1) average time.
Abstract: We present a fuzzy technique for approximate $k$ -mer matching that combines the speed of hashing with the sensitivity of dynamic programming. Our approach exploits the collision detection mechanism used by hash maps, unifying the two phases of “seed and extend” into a single operation that executes in close to $O$ (1) average time.
TL;DR: This paper proposes a novel active hashing approach, Active Hashing with Joint Data Example and Tag Selection (AH-JDETS), which actively selects the most informative data examples and tags in a joint manner for hashing function learning.
Abstract: Similarity search is an important problem in many large scale applications such as image and text retrieval. Hashing method has become popular for similarity search due to its fast search speed and low storage cost. Recent research has shown that hashing quality can be dramatically improved by incorporating supervised information, e.g. semantic tags/labels, into hashing function learning. However, most existing supervised hashing methods can be regarded as passive methods, which assume that the labeled data are provided in advance. But in many real world applications, such supervised information may not be available. This paper proposes a novel active hashing approach, Active Hashing with Joint Data Example and Tag Selection (AH-JDETS), which actively selects the most informative data examples and tags in a joint manner for hashing function learning. In particular, it first identifies a set of informative data examples and tags for users to label based on the selection criteria that both the data examples and tags should be most uncertain and dissimilar with each other. Then this labeled information is combined with the unlabeled data to generate an effective hashing function. An iterative procedure is proposed for learning the optimal hashing function and selecting the most informative data examples and tags. Extensive experiments on four different datasets demonstrate that AH-JDETS achieves good performance compared with state-of-the-art supervised hashing methods but requires much less labeling cost, which overcomes the limitation of passive hashing methods. Furthermore, experimental results also indicate that the joint active selection approach outperforms a random (non-active) selection method and active selection methods only focusing on either data examples or tags.
TL;DR: A novel hashing algorithm called Locality Preserving Hashing is proposed, which learns a set of locality preserving projections with a joint optimization framework, which minimizes the average projection distance and quantization loss simultaneously.
Abstract: Hashing has recently attracted considerable attention for large scale similarity search However, learning compact codes with good performance is still a challenge In many cases, the real-world data lies on a low-dimensional manifold embedded in high-dimensional ambient space To capture meaningful neighbors, a compact hashing representation should be able to uncover the intrinsic geometric structure of the manifold, eg, the neighborhood relationships between subregions Most existing hashing methods only consider this issue during mapping data points into certain projected dimensions When getting the binary codes, they either directly quantize the projected values with a threshold, or use an orthogonal matrix to refine the initial projection matrix, which both consider projection and quantization separately, and will not well preserve the locality structure in the whole learning process In this paper, we propose a novel hashing algorithm called Locality Preserving Hashing to effectively solve the above problems Specifically, we learn a set of locality preserving projections with a joint optimization framework, which minimizes the average projection distance and quantization loss simultaneously Experimental comparisons with other state-of-the-art methods on two large scale datasets demonstrate the effectiveness and efficiency of our method