Statistical Debugging Using Latent Topic Models
David Andrzejewski,Anne Mulhern,Ben Liblit,Xiaojin Zhu +3 more
- 17 Sep 2007
- pp 6-17
TL;DR: Qualitative evaluation by domain experts suggests that the novel Delta-Latent-Dirichlet-Allocation model outperforms existing statistical methods for bug cause identification, and may help support other software tasks not addressed by earlier models.
read more
Abstract: Statistical debugging uses machine learning to model program failures and help identify root causes of bugs. We approach this task using a novel Delta-Latent-Dirichlet-Allocation model. We model execution traces attributed to failed runs of a program as being generated by two types of latent topics: normal usage topics and bug topics. Execution traces attributed to successful runs of the same program, however, are modeled by usage topics only. Joint modeling of both kinds of traces allows us to identify weak bug topics that would otherwise remain undetected. We perform model inference with collapsed Gibbs sampling. In quantitative evaluations on four real programs, our model produces bug topics highly correlated to the true bugs, as measured by the Rand index. Qualitative evaluation by domain experts suggests that our model outperforms existing statistical methods for bug cause identification, and may help support other software tasks not addressed by earlier models.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Probabilistic Topic Models
TL;DR: In this paper, a review of probabilistic topic models can be found, which can be used to summarize a large collection of documents with a smaller number of distributions over words.
Predicting Program Properties from "Big Code"
Veselin Raychev,Martin Vechev,Andreas Krause +2 more
- 14 Jan 2015
TL;DR: This work formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others.
428
Sourcerer: mining and searching internet-scale software repositories
TL;DR: By combining software textual content with structural information captured by the CodeRank approach, this work is able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92, roughly 10–30% better than previous approaches based on text alone.
279
A survey on the use of topic models when mining software repositories
TL;DR: This paper surveys 167 articles from the software engineering literature that make use of topic models and provides a starting point for new researchers who are interested in using topic models, and may help new researchers and practitioners determine how to best apply topic models to a particular software engineering task.
219
Latent Dirichlet Allocation with Topic-in-Set Knowledge
David Andrzejewski,Xiaojin Zhu +1 more
- 04 Jun 2009
TL;DR: This work proposes a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling, to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise.
170
References
Latent dirichlet allocation
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
•Proceedings Article
Latent Dirichlet Allocation
David M. Blei,Andrew Y. Ng,Michael I. Jordan +2 more
- 03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Finding scientific topics
TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Objective Criteria for the Evaluation of Clustering Methods
TL;DR: This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.