TL;DR: In this paper, a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification is proposed. But, there are some limitations of previous deep hashing methods (e.g., the semantic information is not fully exploited).
Abstract: With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years Benefit from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval However, there are some limitations of previous deep hashing methods (eg, the semantic information is not fully exploited) In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets
TL;DR: This work proposes a novel Deep Asymmetric Pairwise Hashing approach (DAPH) for supervised hashing, and devise an efficient alternating algorithm to optimize the asymmetric deep hash functions and high-quality binary code jointly.
Abstract: Recently, deep neural networks based hashing methods have greatly improved the multimedia retrieval performance by simultaneously learning feature representations and binary hash functions. Inspired by the latest advance in the asymmetric hashing scheme, in this work, we propose a novel Deep Asymmetric Pairwise Hashing approach (DAPH) for supervised hashing. The core idea is that two deep convolutional models are jointly trained such that their output codes for a pair of images can well reveal the similarity indicated by their semantic labels. A pairwise loss is elaborately designed to preserve the pairwise similarities between images as well as incorporating the independence and balance hash code learning criteria. By taking advantage of the flexibility of asymmetric hash functions, we devise an efficient alternating algorithm to optimize the asymmetric deep hash functions and high-quality binary code jointly. Experiments on three image benchmarks show that DAPH achieves the state-of-the-art performance on large-scale image retrieval.
TL;DR: A novel hashing method, i.e., Discrete Multi-view Hashing (DMVH), which can work on multi-view data directly and make full use of rich information in multi-View data, and a novel approach to construct similarity matrix, which can not only preserve local similarity structure, but also keep semantic similarity between data points.
Abstract: Recently, hashing techniques have witnessed an increase in popularity due to their low storage cost and high query speed for large scale data retrieval task, eg, image retrieval Many methods have been proposed; however, most existing hashing techniques focus on single view data In many scenarios, there are multiple views in data samples Thus, those methods working on single view can not make full use of rich information contained in multi-view data Although some methods have been proposed for multi-view data; they usually relax binary constraints or separate the process of learning hash functions and binary codes into two independent stages to bypass the obstacle of handling the discrete constraints on binary codes for optimization, which may generate large quantization error To consider these problems, in this paper, we propose a novel hashing method, ie, Discrete Multi-view Hashing (DMVH), which can work on multi-view data directly and make full use of rich information in multi-view data Moreover, in DMVH, we optimize discrete codes directly instead of relaxing the binary constraints so that we could obtain high-quality hash codes Simultaneously, we present a novel approach to construct similarity matrix, which can not only preserve local similarity structure, but also keep semantic similarity between data points To solve the optimization problem in DMVH, we further propose an alternate algorithm We test the proposed model on three large scale data sets Experimental results show that it outperforms or is comparable to several state-of-the-arts
TL;DR: Zhang et al. as mentioned in this paper proposed a novel deep hashing method, called supervised hierarchical deep hashing (SHDH), to perform hash code learning for hierarchical labeled data by weighting each layer, and design a deep convolutional neural network to obtain a hash code for each data point.
Abstract: Recently, hashing methods have been widely used in large-scale image retrieval. However, most existing hashing methods did not consider the hierarchical relation of labels, which means that they ignored the rich information stored in the hierarchy. Moreover, most of previous works treat each bit in a hash code equally, which does not meet the scenario of hierarchical labeled data. In this paper, we propose a novel deep hashing method, called supervised hierarchical deep hashing (SHDH), to perform hash code learning for hierarchical labeled data. Specifically, we define a novel similarity formula for hierarchical labeled data by weighting each layer, and design a deep convolutional neural network to obtain a hash code for each data point. Extensive experiments on several real-world public datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.
TL;DR: Extensive experiments carried out on two popular tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed boosted complementary hash-tables method enjoys the strong table complementarity and significantly outperforms the state-of-the-arts.
Abstract: Hashing has been proven a promising technique for fast nearest neighbor search over massive databases. In many practical tasks it usually builds multiple hash tables for a desired level of recall performance. However, existing multi-table hashing methods suffer from the heavy table redundancy, without strong table complementarity and effective hash code learning. To address the problem, this paper proposes a multi-table learning method which pursues a specified number of complementary and informative hash tables from a perspective of ensemble learning. By regarding each hash table as a neighbor prediction model, the multi-table search procedure boils down to a linear assembly of predictions stemming from multiple tables. Therefore, a sequential updating and learning framework is naturally established in a boosting mechanism, theoretically guaranteeing the table complementarity and algorithmic convergence. Furthermore, each boosting round pursues the discriminative hash functions for each table by a discrete optimization in the binary code space. Extensive experiments carried out on two popular tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed boosted complementary hash-tables method enjoys the strong table complementarity and significantly outperforms the state-of-the-arts.
TL;DR: The results of this study indicate that the search process with double hashing technique allows faster searching than the usual search techniques.
Abstract: The search process is used in various activities performed both online and offline, many algorithms that can be used to perform the search process one of which is a hash search algorithm, search process with hash search algorithm used in this study using double hashing technique where the data will be formed into the table with same length and then search, the results of this study indicate that the search process with double hashing technique allows faster searching than the usual search techniques, this research allows to search the solution by dividing the value into the main table and overflow table so that the search process is expected faster than the data stacked in the form of one table and collision data could avoided.
TL;DR: The problem motivation, the challenges, the key design considerations of multi-probe LSH, as well as discuss recent developments in this space and some questions for further research are revisited.
Abstract: The past decade has been marked by the (continued) explosion of diverse data content and the fast development of intelligent data analytics techniques. One problem we identified in the mid-2000s was similarity search of feature-rich data. The challenge here was achieving both high accuracy and high efficiency in high-dimensional spaces. Locality sensitive hashing (LSH), which uses certain random space partitions and hash table lookups to find approximate nearest neighbors, was a promising approach with theoretical guarantees. But LSH alone was insufficient since a large number of hash tables were required to achieve good search quality. Building on an idea of Panigrahy, our multi-probe LSH method introduced the idea of intelligent probing. Given a query object, we strategically probe its neighboring hash buckets (in a query-dependent fashion) by calculating the statistical probabilities of similar objects falling into each bucket. Such intelligent probing can significantly reduce the number of hash tables while achieving high quality. In this paper, we revisit the problem motivation, the challenges, the key design considerations of multi-probe LSH, as well as discuss recent developments in this space and some questions for further research.
TL;DR: The idea behind MinCounter is to alleviate the occurrence of endless loops in the data insertion by selecting unbusy kicking-out routes and improves the concurrency of the MinCounter scheme to pursue higher performance and adapt to concurrent applications.
Abstract: With the rapid growth of the amount of information, cloud computing servers need to process and analyze large amounts of high-dimensional and unstructured data timely and accurately. This usually requires many query operations. Due to simplicity and ease of use, cuckoo hashing schemes have been widely used in real-world cloud-related applications. However, due to the potential hash collisions, the cuckoo hashing suffers from endless loops and high insertion latency, even high risks of re-construction of entire hash table. In order to address these problems, we propose a cost-efficient cuckoo hashing scheme, called MinCounter. The idea behind MinCounter is to alleviate the occurrence of endless loops in the data insertion by selecting unbusy kicking-out routes. MinCounter selects the “cold” (infrequently accessed), rather than random, buckets to handle hash collisions. We further improve the concurrency of the MinCounter scheme to pursue higher performance and adapt to concurrent applications. MinCounter has the salient features of offering efficient insertion and query services and delivering high performance of cloud servers, as well as enhancing the experiences for cloud users. We have implemented MinCounter in a large-scale cloud testbed and examined the performance by using three real-world traces. Extensive experimental results demonstrate the efficacy and efficiency of MinCounter.
TL;DR: A new supervised deep hashing method to deal with large-scale instance-level vehicle search, which outperforms single task deep hashing methods with classification and triplet ranking losses, respectively.
Abstract: Hashing is a hot research topic in large-scale image search, due to its low memory cost and fast search speed. Recently, deep hashing, which adapts deep convolutional neural networks into hashing, has attracted much attention. In this paper, we propose a new supervised deep hashing method to deal with large-scale instance-level vehicle search, and make the following contributions. Firstly, multi-task learning is employed to learn the hash code, which exploits the available multiple labels of each vehicle, i.e., ID, model, and color. Secondly, differing from several deep hashing methods, which utilize sigmoid or tanh as the activation function of the hash layer, rectified linear unit is adopted in this paper and shows better performance. Thirdly, taking GoogLeNet as the base network, we show that search performance can be promoted significantly, by learning the network's parameters from scratch on our vehicle data. Finally, we perform extensive experiments on a large-scale dataset with up to one million vehicles. The experimental results demonstrate the effectiveness of the proposed method, which outperforms single task deep hashing methods with classification and triplet ranking losses, respectively.
TL;DR: In this article, a comparison of hash functions for similarity estimation with one permutation hashing (OPH) and feature hashing (FH) has been made, showing that mixed tabulation is almost as fast as the multiply-mod-prime scheme ax+b mod p.
Abstract: Hashing is a basic tool for dimensionality reduction employed in several aspects of machine learning. However, the perfomance analysis is often carried out under the abstract assumption that a truly random unit cost hash function is used, without concern for which concrete hash function is employed. The concrete hash function may work fine on sufficiently random input. The question is if it can be trusted in the real world when faced with more structured input.
In this paper we focus on two prominent applications of hashing, namely similarity estimation with the one permutation hashing (OPH) scheme of Li et al. [NIPS'12] and feature hashing (FH) of Weinberger et al. [ICML'09], both of which have found numerous applications, i.e. in approximate near-neighbour search with LSH and large-scale classification with SVM.
We consider mixed tabulation hashing of Dahlgaard et al.[FOCS'15] which was proved to perform like a truly random hash function in many applications, including OPH. Here we first show improved concentration bounds for FH with truly random hashing and then argue that mixed tabulation performs similar for sparse input. Our main contribution, however, is an experimental comparison of different hashing schemes when used inside FH, OPH, and LSH.
We find that mixed tabulation hashing is almost as fast as the multiply-mod-prime scheme ax+b mod p. Mutiply-mod-prime is guaranteed to work well on sufficiently random data, but we demonstrate that in the above applications, it can lead to bias and poor concentration on both real-world and synthetic data. We also compare with the popular MurmurHash3, which has no proven guarantees. Mixed tabulation and MurmurHash3 both perform similar to truly random hashing in our experiments. However, mixed tabulation is 40% faster than MurmurHash3, and it has the proven guarantee of good performance on all possible input.
TL;DR: Recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees are surveyed, including twisted tabulation, which yields an extremely fast pseudorandom number generator that is provably good for many classic randomized algorithms and data-structures.
Abstract: Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here, we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist (A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin). Keys are viewed as consisting of c characters and we have precomputed character tables h1, . . ., hc mapping characters to random hash values. A key x = (x1, . . ., xc) is hashed to h1[x1] ⊕ h2[x2]..... ⊕ hc[xc] This schemes is very fast with character tables in cache. Although simple tabulation is not even four-independent, it does provide many of the guarantees that are normally obtained via higher independence, for example, linear probing and Cuckoo hashing. Next, we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-style tail bounds and a very small bias for minwise hashing. This is also yields an extremely fast pseudorandom number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Wegman and Carter. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. Although these tabulation schemes are all easy to implement and use, their analysis is not.
TL;DR: This paper proposes a novel hashing framework, which simultaneously optimizes similarity preserving hash codes and reconstructs the locally linear structures of data in the Hamming space, and significantly outperforms recent state-of-the-art hashing methods on large-scale image retrieval problems.
Abstract: Learning based hashing has become increasingly popular because of its high efficiency in handling the large scale image retrieval. Preserving the pairwise similarities of data points in the Hamming space is critical in state-of-the-art hashing techniques. However, most previous methods ignore to capture the local geometric structure residing on original data, which is essential for similarity search. In this paper, we propose a novel hashing framework, which simultaneously optimizes similarity preserving hash codes and reconstructs the locally linear structures of data in the Hamming space. In specific, we learn two hash functions such that the resulting two sets of binary codes can well preserve the pairwise similarity and sparse neighborhood in the original feature space. By taking advantage of the flexibility of asymmetric hash functions, we devise an efficient alternating algorithm to optimize the hash coding function and high-quality binary codes jointly. We evaluate the proposed method on several large-scale image datasets, and the results demonstrate it significantly outperforms recent state-of-the-art hashing methods on large-scale image retrieval problems.
TL;DR: A different approach to the problem has been proposed here which prevents the filling up of the hash table by using a Binary Search Tree instead of a table to store the hash values and adds new nodes to this tree only when a new hash value is generated while entering data.
Abstract: This paper aims to reduce the time taken to search for data within a hash table by making use of a tree structure. The two most popular methods for collision avoidance within a hash table — Open Addressing and Separate Chaining each have their merits and demerits. Here we attempt to overcome the main demerits of each of these techniques which are — Filling up of the hash table in Open Addressing and Poor searching time in Separate Chaining. A different approach to the problem has been proposed here which prevents the filling up of the hash table by using a Binary Search Tree instead of a table to store the hash values and adds new nodes to this tree only when a new hash value is generated while entering data. Also, in place of using linked list to store the data within a hash bucket, we use another Binary Search Tree to reduce the search time within each hash bucket. Using this approach, we attempt to reduce both the amount of space and time that will be used as there would be no empty buckets for any hash values and we do not have to worry about filling up the hash table as we will be chaining the data under each bucket. We also try to take care of the main drawback of Separate Chaining i.e longer search time within a hash bucket, by using a Binary Search Tree instead of a linked list, thus reducing the worst case search time from O(n) to O(log n).
TL;DR: A new encoding method which assigns location information to each binary digit is proposed to avoid the time-consuming decimal arithmetic and a novel hash code distance measurement that accelerates the calculation of Manhattan distance is proposedTo improve query efficiency, this paper proposes an accelerated strategy of Manhattan hashing by making full use of bitwise operations.
Abstract: Hashing is a binary-code encoding method which tries to preserve the neighborhood structures in the original feature space, in order to realize efficient approximate nearest neighbor search in large-scale databases. Existing hashing methods usually adopt a two-stage strategy (projection stage and quantization stage) to encode data points, and threshold-based single-bit quantization (SBQ) is used to binarize each projected dimension into 0 or 1. Data similarity between hash codes is measured by their Hamming distance. However, SBQ may destroy the original neighborhood structures by quantizing neighboring points near threshold into different binary values. Double-bit quantization (DBQ) and its derivative, Manhattan hashing, have been proposed to fix this problem. Experimental results showed that Manhattan hashing outperformed state-of-the-art methods in terms of effectiveness, but lost the advantage of efficiency because it used decimal arithmetic instead of fast bitwise operations for similarity measurement between hash codes. In this paper, we propose an accelerated strategy of Manhattan hashing by making full use of bitwise operations. Our main contributions are: 1) a new encoding method which assigns location information to each binary digit is proposed to avoid the time-consuming decimal arithmetic; 2) a novel hash code distance measurement that accelerates the calculation of Manhattan distance is proposed to improve query efficiency. Extensive experiments on three benchmark datasets show that our approach improves the speed of data querying on 2-bit, 3-bit and 4-bit quantized hash codes by at least one order of magnitude on average, without any precision loss.
TL;DR: A new scheme that mitigates collisions by utilizing empty slots of the hash table by inserting a new element that may collide to an empty slot instead of linking it with a pointer is proposed.
Abstract: In SDN and NFV technologies, performance of a virtual switch is important to provide network functionalities swiftly. Since the lookup operation of the virtual switch has been considered a major bottleneck of performance, we need to devise an efficient way to reduce the lookup time. To improve the lookup speed, previous research has suggested a compact lookup table in a fast memory, but the issue of collision in a hash table has not been addressed well enough. This paper proposes a new scheme that mitigates collisions by utilizing empty slots of the hash table. We propose to insert a new element that may collide to an empty slot instead of linking it with a pointer. According to an evaluation on our experiments, about 20% of elements that could have experienced collisions in the conventional scheme are inserted into empty slots in each bucket. Moreover, the access time of collided elements has halved by avoiding unnecessary memory accesses.
TL;DR: In this article, a hashing system can use a set of multiple numbers that are co-prime to the size of a hash table to select a probe offset when collisions occur, which ensures that each hash table slot is available for any insert operation.
Abstract: A hashing system can use a set of multiple numbers that are co-prime to the size of a hash table to select a probe offset when collisions occur. Selecting a probe offset that is co-prime to the hash table size ensures that each hash table slot is available for any insert operation. Utilizing different co-prime numbers for different keys helps avoid clustering of items inserted into the hash table. When a collision occurs, the hashing system can compute a next index to check by selecting a probe offset that is located at a computed index on a list of numbers that are each co-prime to the number of slots in the hash table. The hashing system can compute the index into the list of numbers by applying a hash function to the data item and calculating a modulus of the result with respect to a count of the co-prime numbers list.
TL;DR: The Bootstrap DCH (BDCH) is proposed to relieve the problem of correctly learned images being ignored in new table training which leads to a lack of training samples after a few iterations and Experimental results show that the BDCH outperforms the DCH and several hashing methods.
Abstract: Image retrieval is one of important applications in big data environments. Hashing methods have been widely used to deal with large scale image retrieval problems because of its sublinear time complexities. Hashing methods with multiple hash tables achieve high precision and recall rates using fewer hash bucket visits in comparison to single-table-based hashing. The Dual Complementary Hashing (DCH) is one of the state-of-the-art multi-hashing methods which compensate error made by both previous tables and bits when constructing new one in the training process. However, after a few hash tables are created, the DCH fails to improve performance by creating new hash tables. This is because correctly learned images are ignored in new table training which leads to a lack of training samples after a few iterations. In this paper, the Bootstrap DCH (BDCH) is proposed to relieve this problem. Experimental results on three databases show that the BDCH outperforms the DCH and several hashing methods.
TL;DR: This work proves an improved lower bound for the quantum query complexity of collision-finding in hash functions whose images are distributed according to a nonuniform distribution using a chain of reductions which convert collisions in min-entropy k distributions into collisions in the uniform distribution with constant probability.
Abstract: This work proves an improved lower bound for the quantum query complexity of collision-finding in hash functions whose images are distributed according to a nonuniform distribution. Recent work by [TTU16] applied the leftover hash lemma to show that at least ⌦(2k/9) quantum queries are necessary to find a collision when the image distribution has min-entropy k. In comparison, a result by [Zha15] implies that ⌦(2k/3) quantum queries are necessary if the distribution is uniform. In this paper the lower bound complexity of [Zha15] is extended directly to the non-uniform case using a chain of reductions which convert collisions in min-entropy k distributions into collisions in the uniform distribution with constant probability. This result shows a minimum security guarantee for hash functions under more general assumptions in which the image distribution may not be uniform.
TL;DR: This paper introduces a novel supervised cross-modality hashing framework, which can generate unified binary codes for instances represented in different modalities and significantly outperforms the state-of-the-art multimodality hashing techniques.
Abstract: With the dramatic development of the Internet, how to exploit large-scale retrieval techniques for multimodal web data has become one of the most popular but challenging problems in computer vision and multimedia. Recently, hashing methods are used for fast nearest neighbor search in large-scale data spaces, by embedding high-dimensional feature descriptors into a similarity preserving Hamming space with a low dimension. Inspired by this, in this paper, we introduce a novel supervised cross-modality hashing framework, which can generate unified binary codes for instances represented in different modalities. Particularly, in the learning phase, each bit of a code can be sequentially learned with a discrete optimization scheme that jointly minimizes its empirical loss based on a boosting strategy. In a bitwise manner, hash functions are then learned for each modality, mapping the corresponding representations into unified hash codes. We regard this approach as cross-modality sequential discrete hashing (CSDH), which can effectively reduce the quantization errors arisen in the oversimplified rounding-off step and thus lead to high-quality binary codes. In the test phase, a simple fusion scheme is utilized to generate a unified hash code for final retrieval by merging the predicted hashing results of an unseen instance from different modalities. The proposed CSDH has been systematically evaluated on three standard data sets: Wiki, MIRFlickr, and NUS-WIDE, and the results show that our method significantly outperforms the state-of-the-art multimodality hashing techniques.
TL;DR: OSH: an Online Supervised Hashing technique that is based on Error Correcting Output Codes is proposed, which considers a stochastic setting where the data arrives sequentially and the method learns and adapts its hashing functions in a discriminative manner and yields state-of-the-art retrieval performance.
TL;DR: This paper posing an optimal hash bit selection problem, in which an optimal subset of hash bits are selected from a pool of candidate bits generated by different features, algorithms, or parameters, adopts the bit reliability and their complementarity as the selection criteria that can be carefully tailored for hashing performance in different tasks.
Abstract: To overcome the barrier of storage and computation when dealing with gigantic-scale data sets, compact hashing has been studied extensively to approximate the nearest neighbor search. Despite the recent advances, critical design issues remain open in how to select the right features, hashing algorithms, and/or parameter settings. In this paper, we address these by posing an optimal hash bit selection problem, in which an optimal subset of hash bits are selected from a pool of candidate bits generated by different features, algorithms, or parameters. Inspired by the optimization criteria used in existing hashing algorithms, we adopt the bit reliability and their complementarity as the selection criteria that can be carefully tailored for hashing performance in different tasks. Then, the bit selection solution is discovered by finding the best tradeoff between search accuracy and time using a modified dynamic programming method. To further reduce the computational complexity, we employ the pairwise relationship among hash bits to approximate the high-order independence property, and formulate it as an efficient quadratic programming method that is theoretically equivalent to the normalized dominant set problem in a vertex- and edge-weighted graph. Extensive large-scale experiments have been conducted under several important application scenarios of hash techniques, where our bit selection framework can achieve superior performance over both the naive selection methods and the state-of-the-art hashing algorithms, with significant accuracy gains ranging from 10% to 50%, relatively.
TL;DR: This paper designs, develops and evaluates REX, a resilient and efficient data structure for tracking of network flows that not only rejects the least number of packets, but also significantly reduces the total time taken for the important hash table operations.