On the naturalness of software

doi:10.1145/2902362

Journal Article10.1145/2902362

On the naturalness of software

Abram Hindle, +4 more

- 26 Apr 2016

- Communications of The ACM

- Vol. 59, Iss: 5, pp 122-131

933

TL;DR: The conjecture that most software is also natural - in the sense that it is created by humans at work, with all the attendant constraints and limitations - and thus, like natural language, it is also likely to be repetitive and predictable is investigated.

Abstract: Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3290353

code2vec: learning distributed representations of code

Uri Alon, +3 more

- 02 Jan 2019

TL;DR: A neural model for representing snippets of code as continuous distributed vectors as a single fixed-length code vector which can be used to predict semantic properties of the snippet, making it the first to successfully predict method names based on a large, cross-project corpus.

...read moreread less

1.4K

•Proceedings Article•10.1145/3196321.3196334

Deep code comment generation

Xing Hu, +4 more

- 28 May 2018

TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.

...read moreread less

857

•Journal Article•10.1145/3212695

A Survey of Machine Learning for Big Code and Naturalness

Miltiadis Allamanis, +3 more

- 31 Jul 2018

- ACM Computing Surveys

TL;DR: A survey of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code.

...read moreread less

779

Proceedings Article•10.1145/2884781.2884804

Automatically learning semantic features for defect prediction

Song Wang, +2 more

- 14 May 2016

TL;DR: This paper proposes to leverage a powerful representation-learning algorithm, deep learning, to learn semantic representation of programs automatically from source code, using Deep Belief Network to automatically learn semantic features from token vectors extracted from programs' Abstract Syntax Trees.

...read moreread less

768

...

Expand

References

Journal Article•10.1007/S10618-008-0118-X

Sourcerer: mining and searching internet-scale software repositories

Erik Linstead, +5 more

- 01 Apr 2009

- Data Mining and Knowledge Discovery

TL;DR: By combining software textual content with structural information captured by the CodeRank approach, this work is able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92, roughly 10–30% better than previous approaches based on text alone.

...read moreread less

279

Proceedings Article•10.1109/ICPC.2006.51

Whats in a Name? A Study of Identifiers

Dawn Lawrie, +3 more

- 14 Jun 2006

TL;DR: A study involving over 100 programmers who were asked to describe twelve different functions shows that full word identifiers lead to the best comprehension; however, in many cases, there is no statistical difference between full words and abbreviations.

...read moreread less

262

•Proceedings Article•10.1145/1882291.1882315

A study of the uniqueness of source code

Mark Gabel, +1 more

- 07 Nov 2010

TL;DR: The first study of the uniqueness of source code is presented, examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that is called syntactic redundancy.

...read moreread less

257

Proceedings Article•10.1145/1858996.1859005

Automatically documenting program changes

Raymond P.L. Buse, +1 more

- 20 Sep 2010

TL;DR: An automatic technique for synthesizing succinct human-readable documentation for arbitrary program differences is presented, based on a combination of symbolic execution and a novel approach to code summarization, that is suitable for supplementing or replacing 89% of existing log messages that directly describe a code change.

...read moreread less

243

Proceedings Article•10.1145/1985793.1985808

Automatically detecting and describing high level actions within methods

Giriprasad Sridhara, +2 more

- 21 May 2011

TL;DR: This work presents an automatic technique for identifying code fragments that implement high level abstractions of actions and expressing them as a natural language description andJudgements of the generated descriptions by 15 experienced Java programmers strongly suggest that indeed they view the fragments that the authors identify as representing high level actions and their synthesized descriptions accurately express the abstraction.

...read moreread less

229

...

Expand

On the naturalness of software

Chat with Paper

AI Agents for this Paper

Citations

code2vec: learning distributed representations of code

Competition-level code generation with AlphaCode

Deep code comment generation

A Survey of Machine Learning for Big Code and Naturalness

Automatically learning semantic features for defect prediction

References

Sourcerer: mining and searching internet-scale software repositories

Whats in a Name? A Study of Identifiers

A study of the uniqueness of source code

Automatically documenting program changes

Automatically detecting and describing high level actions within methods

Related Papers (5)

A Survey of Machine Learning for Big Code and Naturalness

Code completion with statistical language models

Suggesting accurate method and class names

Long short-term memory

Summarizing Source Code using a Neural Attention Model

On the naturalness of software

Chat with Paper

AI Agents for this Paper

Citations

code2vec: learning distributed representations of code

Competition-level code generation with AlphaCode

Deep code comment generation

A Survey of Machine Learning for Big Code and Naturalness

Automatically learning semantic features for defect prediction

References

Sourcerer: mining and searching internet-scale software repositories

What&#146;s in a Name? A Study of Identifiers

A study of the uniqueness of source code

Automatically documenting program changes

Automatically detecting and describing high level actions within methods

Related Papers (5)

A Survey of Machine Learning for Big Code and Naturalness

Code completion with statistical language models

Suggesting accurate method and class names

Long short-term memory

Summarizing Source Code using a Neural Attention Model

Whats in a Name? A Study of Identifiers