Do Large Code Models Understand Programming Concepts? A Black-box Approach

doi:10.48550/arxiv.2402.05980

Journal Article10.48550/arxiv.2402.05980

Do Large Code Models Understand Programming Concepts? A Black-box Approach

Ashish Hooda, +5 more

- 08 Feb 2024

- arXiv.org

- Vol. abs/2402.05980

4

TL;DR: This work proposes Counterfactual Analysis for Programming Concept Predicates (CACP) as a counterfactual testing framework to evaluate whether Large Code Models understand programming concepts, and suggests that current models lack understanding of concepts such as data flow and control flow.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Table 1. Number of valid counterfactual pairs per mutation type.

Figure 1. In this example the counterfactual input is generated by negating the relational expression in the if statement. Starcoder (Li et al., 2023) generates an incorrect completion for the input on the right. This suggests that LLMs have incomplete understanding of programming concepts such as control-flow.

Table 2. We compute the AME using the Pass/Fail attribute function as described in subsection 4.3. We only consider problems where the model achieves non zero accuracy on either the original or the counterfactual setting.

Figure 4. Correlation between AME values across pairs of mutations. The number of samples used to compute each value depends on the size of the intersection of the two mutation types. Independent-Swap: SWAP, IfElse-Flip: IFFP, Variable Names Random: RAND, Variable Names Shuffle: SHUF

Table 3. Memorization Analysis for the If-Else mutation for Starcoder. We parse Starcoder’s training data and show the relative frequency of appearance of pairs of complementary relational operators. We also show the average change in unit test correctness computed over all valid programs in HumanEval, MBPP and CodeContests.

Figure 3. AME as a function of model size (number of parameters in Billions). The different model classes are depicted using different colors.

Citations

Journal Article•10.48550/arxiv.2407.03611

An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

Thu-Trang Nguyen, +3 more

- 03 Jul 2024

TL;DR: This study evaluates the capabilities of Large Language Models for Code (code LLMs) in understanding code semantics using a framework called EMPICA, revealing varying robustness and sensitivity across tasks and transformations, highlighting the need for improved model capabilities.

...read moreread less

Journal Article•10.48550/arxiv.2409.12866

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

Lezhi Ma, +5 more

- 19 Sep 2024

- arXiv.org

TL;DR: This study proposes SpecEval, a novel black-box evaluation framework to assess code comprehension in large language models via program specifications, revealing limitations in existing LLMs' ability to articulate program semantics and underscoring future enhancement directions.

...read moreread less

Journal Article•10.48550/arxiv.2403.03894

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

Indraneil Paul, +3 more

- 06 Mar 2024

- arXiv.org

TL;DR: The prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer is investigated.

...read moreread less

Journal Article•10.48550/arxiv.2404.01903

Activation Steering for Robust Type Prediction in CodeLLMs

Francesca Lucchetti, +1 more

- 02 Apr 2024

- arXiv.org

TL;DR: Activation steering for robust type prediction in CodeLLMs makes LLMs more robust to syntactic distractors by editing internal model activations.

...read moreread less

References

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, +57 more

- 18 Jul 2023

TL;DR: This article developed and released Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.

...read moreread less

5.7K

•Journal Article•10.1214/09-SS057

Causal inference in statistics: An overview

Judea Pearl

- 15 Jul 2009

- Statistics Surveys

TL;DR: A review of recent advances in causal inference can be found in this article, where a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a) is presented.

...read moreread less

2.4K

•Journal Article

An Axiomatic Basis for Computer Programming (Reprint).

C. A. R. Hoare

- 01 Jan 1983

- Communications of The ACM

Abstract: In this paper an attempt is made to explore the logical foundations of computer programming by use of techniques which were first applied in the study of geometry and have later been extended to other branches of mathematics. This involves the elucidation of sets of axioms and rules of inference which can be used in proofs of the properties of computer programs. Examples are given of such axioms and rules, and a formal proof of a simple theorem is displayed. Finally, it is argued that important advantages, both theoretical and practical, may follow from a pursuance of these topics.

...read moreread less

1.5K

•Proceedings Article•10.18653/V1/W15-3049

chrF: character n-gram F-score for automatic MT evaluation

Maja Popović

- 01 Sep 2015

TL;DR: The proposed use of character n-gram F-score for automatic evaluation of machine translation output shows very promising results, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.

...read moreread less

1.3K

...

Expand

Do Large Code Models Understand Programming Concepts? A Black-box Approach

Chat with Paper

AI Agents for this Paper

Figures

Citations

An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

SpecEval: Evaluating Code Comprehension in Large Language Models via Program Specifications

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

Activation Steering for Robust Type Prediction in CodeLLMs

References

Llama 2: Open Foundation and Fine-Tuned Chat Models

Causal inference in statistics: An overview

An Axiomatic Basis for Computer Programming (Reprint).

chrF: character n-gram F-score for automatic MT evaluation

Competition-level code generation with AlphaCode