A Survey of Machine Learning for Big Code and Naturalness
TL;DR: A survey of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code.
read more
Abstract: Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Shuo Ren,Daya Guo,Shuai Lu,Long Zhou,Shujie Liu,Duyu Tang,Neel Sundaresan,Ming Zhou,Ambrosio Blanco,Shuai Ma +9 more
TL;DR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.
427
Software Vulnerability Detection Using Deep Neural Networks: A Survey
Guanjun Lin,Sheng Wen,Qing-Long Han,Jun Zhang,Yang Xiang +4 more
- 04 Jun 2020
TL;DR: This survey reviews the current literature adopting deep-learning-/neural-network-based approaches for detecting software vulnerabilities, aiming at investigating how the state-of-the-art research leverages neural techniques for learning and understanding code semantics to facilitate vulnerability discovery.
417
A neural model for generating natural language summaries of program subroutines
Alexander LeClair,Siyuan Jiang,Collin McMillan +2 more
- 25 May 2019
TL;DR: In this article, a neural model that combines words from code with code structure from an AST is presented, which allows the model to learn code structure independent of the text in code.
390
CoCoNuT: combining context-aware neural translation models using ensemble for program repair
Thibaud Lutellier,Hung Viet Pham,Lawrence Pang,Yitong Li,Moshi Wei,Lin Tan +5 more
- 18 Jul 2020
TL;DR: A new G&V technique—CoCoNuT, which uses ensemble learning on the combination of convolutional neural networks (CNNs) and a new context-aware neural machine translation (NMT) architecture to automatically fix bugs in multiple programming languages.
376
•Posted Content
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu,Daya Guo,Shuo Ren,Junjie Huang,Alexey Svyatkovskiy,Ambrosio Blanco,Colin B. Clement,Dawn Drain,Daxin Jiang,Duyu Tang,Ge Li,Lidong Zhou,Linjun Shou,Long Zhou,Michele Tufano,Ming Gong,Ming Zhou,Nan Duan,Neel Sundaresan,Shao Kun Deng,Fu Shengyu,Shujie Liu +21 more
TL;DR: CodeXGLUE as mentioned in this paper is a benchmark dataset to foster machine learning research for program understanding and generation, which includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
332
References
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
•Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau,Kyunghyun Cho,Yoshua Bengio +2 more
- 01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
25.7K
Anomaly detection: A survey
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Related Papers (5)
Uri Alon,Meital Zilberstein,Omer Levy,Eran Yahav +3 more
- 02 Jan 2019
Veselin Raychev,Martin Vechev,Eran Yahav +2 more
- 09 Jun 2014