Journal Article10.1109/JSTSP.2016.2555240
A Provably Efficient Algorithm for Separable Topic Discovery
4
TL;DR: In this article, the authors develop necessary and sufficient conditions and a novel provably consistent and efficient algorithm for discovering topics from observations (documents) that are realized from a probabilistic mixture of shared latent factors that have certain properties.
read more
Abstract: We develop necessary and sufficient conditions and a novel provably consistent and efficient algorithm for discovering topics (latent factors) from observations (documents) that are realized from a probabilistic mixture of shared latent factors that have certain properties. Our focus is on the class of topic models in which each shared latent factor contains a novel word that is unique to that factor, a property that has come to be known as separability. Our algorithm is based on the key insight that the novel words correspond to the extreme points of the convex hull formed by the row-vectors of a suitably normalized word co-occurrence matrix. We leverage this geometric insight to establish polynomial computational and sample complexity bounds based on a few isotropic random projections of the rows of the normalized word co-occurrence matrix. Our proposed random-projections-based algorithm is naturally amenable to an efficient distributed implementation and is attractive for modern web-scale distributed data mining applications.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Journal Article
A rate of convergence for mixture proportion estimation, with application to learning from noisy labels
TL;DR: This work establishes a rate of convergence for mixture proportion estimation under an appropriate distributional assumption, and argues that this rate of converge is useful for analyzing weakly supervised learning algorithms that build on MPE.
79
•Journal Article
Decontamination of mutually contaminated models
Gilles Blanchard,Clayton Scott +1 more
TL;DR: A procedure for decontamination of the contaminated models from data is developed, which then facilitates the design of a consistent discrimination rule and relies on a novel method for estimating the error when projecting one distribution onto a convex combination of others, where the projection is with respect to a statistical distance known as the separation distance.
14
A Geometric Approach to Archetypal Analysis and Nonnegative Matrix Factorization.
Anil Damle,Yuekai Sun +1 more
TL;DR: A geometric approach to both NMF and archetypal analysis is described by interpreting both problems as finding extreme points of the data cloud by developing and analyzing an efficient approach to findingextreme points in high dimensions.
•Dissertation
Learning mixed membership models with a separable latent structure: Theory, provably efficient algorithms, and applications
Weicong Ding
- 01 Jan 2015
TL;DR: In a wide spectrum of problems in science and engineering that includes hyperspectral imaging, gene expression analysis, and machine learning tasks such as topic modeling, the observed data is high-dimensional and can be modeled as arising from a dataspecific probabilistic mixture of a small collection of latent factors.
3
References
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Learning the parts of objects by non-negative matrix factorization
TL;DR: An algorithm for non-negative matrix factorization is demonstrated that is able to learn parts of faces and semantic features of text and is in contrast to other methods that learn holistic, not parts-based, representations.
14.2K
Finding scientific topics
TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Probabilistic topic models
TL;DR: Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.
5.6K
Related Papers (5)
Pranjal Awasthi,Aravindan Vijayaraghavan +1 more
- 01 Jan 2018
Rocco Langone,Vilen Jumutc,Johan A. K. Suykens +2 more
- 01 Jan 2017