Topic

MinHash

About: MinHash is a research topic. Over the lifetime, 201 publications have been published within this topic receiving 9601 citations.

...read moreread less

Topic Tools

Find unexplored research gaps

Generate a literature review

Explore related concepts

Papers published on a yearly basis

Papers

Proceedings Article•10.1145/509907.509965•

Similarity estimation techniques from rounding algorithms

[...]

Moses Charikar¹•Institutions (1)

Princeton University¹

19 May 2002

TL;DR: It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects.

...read moreread less

Abstract: (MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, PrheF[h(x) = h(y)] = sim(x,y), where sim(x,y) e [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:A collection of vectors with the distance between → \over u and → \over v measured by O(→ \over u, → \over v)/π, where O(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ Ehe\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).

...read moreread less

2,941 citations

Journal Article•10.1186/S13059-016-0997-X•

Mash: fast genome and metagenome distance estimation using MinHash.

[...]

Brian D. Ondov, Todd J. Treangen, Páll Melsted¹, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren², Adam M. Phillippy² - Show less +3 more•Institutions (2)

University of Iceland¹, National Institutes of Health²

20 Jun 2016-Genome Biology

TL;DR: Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections.

...read moreread less

Abstract: Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license ( https://github.com/marbl/mash ).

...read moreread less

2,864 citations

Proceedings Article•10.1145/1242572.1242610•

Google news personalization: scalable online collaborative filtering

[...]

Abhinandan S. Das¹, Mayur Datar¹, Ashutosh Garg¹, Shyam Rajaram²•Institutions (2)

Google¹, University of Illinois at Urbana–Champaign²

8 May 2007

TL;DR: This paper describes the approach to collaborative filtering for generating personalized recommendations for users of Google News using MinHash clustering, Probabilistic Latent Semantic Indexing, and covisitation counts, and combines recommendations from different algorithms using a linear model.

...read moreread less

Abstract: Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several millionusers and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptable for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

...read moreread less

1,917 citations

Journal Article•10.1016/S0169-7552(97)00031-7•

Syntactic clustering of the Web

[...]

Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig¹•Institutions (1)

University of California, Berkeley¹

1 Sep 1997

TL;DR: An efficient way to determine the syntactic similarity of files is developed and applied to every document on the World Wide Web, and a clustering of all the documents that are syntactically similar is built.

...read moreread less

Abstract: We have developed an efficient way to determine the syntactic similarity of files and have applied it to every document on the World Wide Web. Using this mechanism, we built a clustering of all the documents that are syntactically similar. Possible applications include a "Lost and Found" service, filtering the results of Web searches, updating widely distributed web-pages, and identifying violations of intellectual property rights.

...read moreread less

1,672 citations

Proceedings Article•10.1145/1282280.1282359•

Scalable near identical image and shot detection

[...]

Ondřej Chum¹, James Philbin¹, Michael Isard, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

9 Jul 2007

TL;DR: Two novel schemes for near duplicate image and video-shot detection based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval and local feature descriptors, are proposed and compared.

...read moreread less

Abstract: This paper proposes and compares two novel schemes for near duplicate image and video-shot detection. The first approach is based on global hierarchical colour histograms, using Locality Sensitive Hashing for fast retrieval. The second approach uses local feature descriptors (SIFT) and for retrieval exploits techniques used in the information retrieval community to compute approximate set intersections between documents using a min-Hash algorithm.The requirements for near-duplicate images vary according to the application, and we address two types of near duplicate definition: (i) being perceptually identical (e.g. up to noise, discretization effects, small photometric distortions etc); and (ii) being images of the same 3D scene (so allowing for viewpoint changes and partial occlusion). We define two shots to be near-duplicates if they share a large percentage of near-duplicate frames.We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. Both methods are designed so that only a small amount of data need be stored for each image. In the case of near-duplicate shot detection it is shown that a weak approximation to histogram matching, consuming substantially less storage, is sufficient for good results. We demonstrate our methods on the TRECVID 2006 data set which contains approximately 165 hours of video (about 17.8M frames with 146K key frames), and also on feature films and pop videos.

...read moreread less

591 citations

...

Expand

Performance Metrics

201

Papers

697

Citations

No. of papers in the topic in previous years
Year	Papers
2021	29
2020	22
2019	27
2018	34
2017	19
2016	17

MinHash

Topic Tools

Papers published on a yearly basis

Papers

Similarity estimation techniques from rounding algorithms

Mash: fast genome and metagenome distance estimation using MinHash.

Google news personalization: scalable online collaborative filtering

Syntactic clustering of the Web

Scalable near identical image and shot detection

Related Topics (5)

Performance Metrics