Journal Article10.48550/arxiv.2402.05980
Do Large Code Models Understand Programming Concepts? A Black-box Approach
Ashish Hooda,Mihai Christodorescu,Miltos Allamanis,Aaron Wilson,Kassem Fawaz,Somesh Jha +5 more
TL;DR: This work proposes Counterfactual Analysis for Programming Concept Predicates (CACP) as a counterfactual testing framework to evaluate whether Large Code Models understand programming concepts, and suggests that current models lack understanding of concepts such as data flow and control flow.
read more
Abstract: Large Language Models' success on text generation has also made them better at code generation and coding tasks. While a lot of work has demonstrated their remarkable performance on tasks such as code completion and editing, it is still unclear as to why. We help bridge this gap by exploring to what degree auto-regressive models understand the logical constructs of the underlying programs. We propose Counterfactual Analysis for Programming Concept Predicates (CACP) as a counterfactual testing framework to evaluate whether Large Code Models understand programming concepts. With only black-box access to the model, we use CACP to evaluate ten popular Large Code Models for four different programming concepts. Our findings suggest that current models lack understanding of concepts such as data flow and control flow.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Table 1. Number of valid counterfactual pairs per mutation type. 
Figure 1. In this example the counterfactual input is generated by negating the relational expression in the if statement. Starcoder (Li et al., 2023) generates an incorrect completion for the input on the right. This suggests that LLMs have incomplete understanding of programming concepts such as control-flow. 
Table 2. We compute the AME using the Pass/Fail attribute function as described in subsection 4.3. We only consider problems where the model achieves non zero accuracy on either the original or the counterfactual setting. 
Figure 4. Correlation between AME values across pairs of mutations. The number of samples used to compute each value depends on the size of the intersection of the two mutation types. Independent-Swap: SWAP, IfElse-Flip: IFFP, Variable Names Random: RAND, Variable Names Shuffle: SHUF 
Table 3. Memorization Analysis for the If-Else mutation for Starcoder. We parse Starcoder’s training data and show the relative frequency of appearance of pairs of complementary relational operators. We also show the average change in unit test correctness computed over all valid programs in HumanEval, MBPP and CodeContests. 
Figure 3. AME as a function of model size (number of parameters in Billions). The different model classes are depicted using different colors.
Citations
An Empirical Study on Capability of Large Language Models in Understanding Code Semantics
Thu-Trang Nguyen,Thanh Trong Vu,Hieu Dinh Vo,Son Nguyen +3 more
- 03 Jul 2024
TL;DR: This study evaluates the capabilities of Large Language Models for Code (code LLMs) in understanding code semantics using a framework called EMPICA, revealing varying robustness and sensitivity across tasks and transformations, highlighting the need for improved model capabilities.
SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications
Lezhi Ma,Shangqing Liu,Lei Bu,Shangru Li,Yida Wang,Yang Liu +5 more
TL;DR: This study proposes SpecEval, a novel black-box evaluation framework to assess code comprehension in large language models via program specifications, revealing limitations in existing LLMs' ability to articulate program semantics and underscoring future enhancement directions.
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
TL;DR: The prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer is investigated.
Activation Steering for Robust Type Prediction in CodeLLMs
TL;DR: Activation steering for robust type prediction in CodeLLMs makes LLMs more robust to syntactic distractors by editing internal model activations.
References
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron,Louis Martin,Kevin R. Stone,Amjad Almahairi,Soumya Batra,Prajjwal Bhargava,Shruti Bhosale,Daniel M. Bikel,Lukas Blecher,Cristian Canton-Ferrer,Moya Chen,Guillem Cucurull,David Esiobu,Jude Fernandes,Cynthia Gao,Vedanuj Goswami,Naman Goyal,Anthony S. Hartshorn,Saghar Hosseini,Rui Hou,Hakan Inan,Marcin Kardas,Viktor Kerkez,Madian Khabsa,Isabel M. Kloumann,A. V. Korenev,Punit Singh Koura,Marie-Anne Lachaux,Thibaut Lavril,Diana Liskovich,Yinghai Lu,Yuning Mao,Xavier Martinet,Todor Mihaylov,Pushkar Mishra,Igor Molybog,Yixin Nie,Andrew M. Poulton,Jeremy Reizenstein,Rashi Rungta,Kalyan Saladi,Alan Schelten,Eric A. Smith,R. Subramanian,Xia Tan,Binh Tang,Ross Taylor,Adina Williams,Zhengxu Yan,Iliyan Radev Zarov,Yuchen Zhang,Angela Fan,Melanie Rae Kambadur,Sharan Narang,Aur'elien Rodriguez,Robert Stojnic,Sergey Edunov,Thomas Scialom +57 more
- 18 Jul 2023
TL;DR: This article developed and released Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Causal inference in statistics: An overview
TL;DR: A review of recent advances in causal inference can be found in this article, where a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a) is presented.
•Journal Article
An Axiomatic Basis for Computer Programming (Reprint).
Abstract: In this paper an attempt is made to explore the logical foundations of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics. This involves the elucidation of sets of axioms and rules of inference which can be used in proofs of the properties of computer programs. Examples are given of such axioms and rules, and a formal proof of a simple theorem is displayed. Finally, it is argued that important advantages, both theoretical and practical, may follow from a pursuance of these topics.
1.5K
chrF: character n-gram F-score for automatic MT evaluation
Maja Popović
- 01 Sep 2015
TL;DR: The proposed use of character n-gram F-score for automatic evaluation of machine translation output shows very promising results, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.
Competition-level code generation with AlphaCode
Yujia Li,David H. Choi,Junyoung Chung,Nate Kushman,Julian Schrittwieser,Rémi Leblond,Tom,Eccles,James Keeling,Felix Gimeno,Agustin Dal Lago,Thomas Hubert,Peter Choy,Cyprien de,Masson d’Autume,Igor Babuschkin,Xinyun Chen,Po-Sen Huang,Johannes Welbl,Sven Gowal,Alexey,Cherepanov,James Molloy,Daniel J. Mankowitz,Esme Sutherland Robson,Pushmeet Kohli,Nando de,Freitas,Koray Kavukcuoglu,Oriol Vinyals +29 more
TL;DR: Yujia Li*, David Choi*, Junyoung Chung*, Nate Kushman*, Julian Schrittwieser*, Rémi Leblond*, Tom Eccles*, James Keeling*, Felix Gimeno*, Agustin Dal Lago*, Thomas Hubert*, Peter Choy*, Cyprien de Masson d’Autume*, Igor Babuschkin, Xinyun Chen