Journal Article10.1145/2902362
On the naturalness of software
933
TL;DR: The conjecture that most software is also natural - in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable is investigated.
read more
Abstract: Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
code2vec: learning distributed representations of code
Uri Alon,Meital Zilberstein,Omer Levy,Eran Yahav +3 more
- 02 Jan 2019
TL;DR: A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.
1.4K
Competition-level code generation with AlphaCode
Yujia Li,David H. Choi,Junyoung Chung,Nate Kushman,Julian Schrittwieser,Rémi Leblond,Tom,Eccles,James Keeling,Felix Gimeno,Agustin Dal Lago,Thomas Hubert,Peter Choy,Cyprien de,Masson d’Autume,Igor Babuschkin,Xinyun Chen,Po-Sen Huang,Johannes Welbl,Sven Gowal,Alexey,Cherepanov,James L. Molloy,Daniel J. Mankowitz,Esme Sutherland Robson,Pushmeet Kohli,Nando de,Freitas,Koray Kavukcuoglu,Oriol Vinyals +29 more
TL;DR: Yujia Li*, David Choi*, Junyoung Chung*, Nate Kushman*, Julian Schrittwieser*, Rémi Leblond*, Tom Eccles*, James Keeling*, Felix Gimeno*, Agustin Dal Lago*, Thomas Hubert*, Peter Choy*, Cyprien de Masson d’Autume*, Igor Babuschkin, Xinyun Chen
Deep code comment generation
TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.
857
A Survey of Machine Learning for Big Code and Naturalness
TL;DR: A survey of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code.
779
Automatically learning semantic features for defect prediction
Song Wang,Taiyue Liu,Lin Tan +2 more
- 14 May 2016
TL;DR: This paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code, using Deep Belief Network to automatically learn semantic features from token vectors extracted from programs' Abstract Syntax Trees.
768
References
Sourcerer: mining and searching internet-scale software repositories
TL;DR: By combining software textual content with structural information captured by the CodeRank approach, this work is able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92, roughly 10–30% better than previous approaches based on text alone.
279
What’s in a Name? A Study of Identifiers
Dawn Lawrie,Christopher H. Morrell,Henry Feild,Dave Binkley +3 more
- 14 Jun 2006
TL;DR: A study involving over 100 programmers who were asked to describe twelve different functions shows that full word identifiers lead to the best comprehension; however, in many cases, there is no statistical difference between full words and abbreviations.
262
A study of the uniqueness of source code
Mark Gabel,Zhendong Su +1 more
- 07 Nov 2010
TL;DR: The first study of the uniqueness of source code is presented, examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that is called syntactic redundancy.
Automatically documenting program changes
Raymond P.L. Buse,Westley Weimer +1 more
- 20 Sep 2010
TL;DR: An automatic technique for synthesizing succinct human-readable documentation for arbitrary program differences is presented, based on a combination of symbolic execution and a novel approach to code summarization, that is suitable for supplementing or replacing 89% of existing log messages that directly describe a code change.
Automatically detecting and describing high level actions within methods
Giriprasad Sridhara,Lori Pollock,K. Vijay-Shanker +2 more
- 21 May 2011
TL;DR: This work presents an automatic technique for identifying code fragments that implement high level abstractions of actions and expressing them as a natural language description andJudgements of the generated descriptions by 15 experienced Java programmers strongly suggest that indeed they view the fragments that the authors identify as representing high level actions and their synthesized descriptions accurately express the abstraction.
229
Related Papers (5)
Veselin Raychev,Martin Vechev,Eran Yahav +2 more
- 09 Jun 2014
Miltiadis Allamanis,Earl T. Barr,Christian Bird,Charles Sutton +3 more
- 30 Aug 2015