Journal Article10.48550/arxiv.2403.03894
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
Indraneil Paul,Jun Luo,Goran Glavas,Iryna Gurevych +3 more
5
TL;DR: The prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer is investigated.
read more
Abstract: Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer. To this end, we first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files coupled with respective intermediate representations. Next, starting from various base Code-LMs (ranging in size from 1.1B to 7.3B parameters), we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2) align the IR constructs with respective constructs of various programming languages. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Survey on Large Language Models for Code Generation
J.-H.R. Jiang,Fan Wang,Jiasi Shen,Seong-Ju Kim,Sunghun Kim +4 more
- 01 Jun 2024
TL;DR: This survey provides a comprehensive review of Large Language Models (LLMs) for code generation, introducing a taxonomy, historical overview, and empirical comparison of recent developments, highlighting advancements, challenges, and opportunities in this burgeoning field.
Continual Learning of Large Language Models: A Comprehensive Survey
Henry X. Shi,Xu Zhang,Hengyi Wang,Qin Wang,Wenyuan Wang,Yibin Wang,Hao Wang +6 more
- 25 Apr 2024
TL;DR: A survey on continual learning of large language models focusing on integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences.
A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages
Jie Wu,Fatemeh Fard +1 more
Abstract: Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge. This gap affects millions of developers - with Rust alone having 3.5 million users - who are currently unable to fully leverage LLM capabilities. LRPLs and DSLs face unique challenges, including severe data scarcity and, for DSLs, highly specialized syntax and semantics that are poorly represented in general-purpose datasets. Addressing these challenges is crucial as LRPLs and DSLs significantly enhance development efficiency in specialized domains and applications, including financial and scientific works. While several surveys on LLMs for software engineering and code exist, none comprehensively address the challenges and opportunities specific to LRPLs and DSLs. Our survey fills this gap by providing a systematic review of the current state, methodologies, and challenges in leveraging LLMs for code generation in LRPL and DSL. We filtered 111 papers from over 27,000 published studies from 2020 – 2024 to understand the capabilities and limitations of LLMs in these specialized domains. We also expanded our literature search to include 5 recent papers from 2024 – 2025. We report LLMs used, benchmarks, and metrics to evaluate code generation in LRPLs and DSLs, as well as strategies used to enhance LLM performance, and the collected datasets and curation methods in this context. We identified four main evaluation techniques used in the literature, along with several metrics to assess code generation in LRPL and DSL. We categorized the methods used for LLM improvement into six main groups and summarized the novel methods and architectures proposed by the researchers. We also classified different approaches used for data collection and preparation. While different techniques, metrics, and datasets are used, there is a lack of a standard approach and a benchmark dataset to evaluate code generation in several LRPLs and DSLs. We discuss several distinctions of the studied approaches with the ones used in high-resource programming languages (HRPLs), as well as several challenges unique to these languages, especially DSLs. The challenges stem from the scarcity of data, the unique requirements, and specialized domains, which often need expertise guidelines or domain-specific tools. Accordingly, we provide insights into different research opportunities for the studied aspects. This survey serves as a comprehensive resource for researchers and practitioners working at the intersection of LLMs, software engineering, and specialized programming languages, providing a foundation for future advancements in LRPL and DSL code generation. A GitHub repository was created to organize the papers of this survey at https://github.com/jie-jw-wu/Survey-CodeLLM4LowResource-DSL .
4
Analysing Software Quality of AI-Translated Code: A Comparative Study of Large Language Models Using Static Analysis
Vikram Bhutani,Farshad Ghassemi Toosi,Jim Buckley +2 more
Abstract: Abstract Context: Source code translation enables cross-platform compatibility, code reusability, legacy system migration, and developer collaboration. Numerous state-of-the-art techniques have emerged to address demand for efficient and accurate translation methodologies. Objective: This study compares code translation capabilities of Large Language Models (LLMs), specifically DeepSeek R1 and ChatGPT 4.1, evaluating their proficiency in translating code between programming languages. We systematically assess model outputs through quantitative and qualitative measures, focusing on translation accuracy, execution efficiency, and coding standard conformity. By examining each model’s strengths and limitations, this work provides insights into their applicability for various translation scenarios and contributes to discourse on LLM potential in software engineering. Method: We evaluated translation quality from ChatGPT 4.1 and DeepSeek R1 using SonarQube Analyzer to identify strengths and weaknesses through comprehensive software metrics including translation accuracy, code quality, and clean code attributes. SonarQube’s framework enables objective quantification of maintainability, reliability, technical debt, and code smells which are critical factors in software quality measurement. The protocol involved randomly sampling 500 code instances from 1695 Java programming problems. Java samples were translated to Python by both models, then analysed quantitatively using SonarQube metrics to evaluate adherence to software engineering best practices. Results: This comparative analysis reveals capabilities and limitations of state-of-the-art LLM-based translation systems, providing developers, researchers, and practitioners actionable guidance for model selection. Identified gaps highlight future research directions in automated code translation. Result s demonstrate that DeepSeek R1 consistently generates superior software quality compared to ChatGPT 4.1 across Sonar-Qube metrics.
Generating Equivalent Representations of Code By A Self-Reflection Approach
TL;DR: This paper proposes a self-reflection approach to generate Equivalent Representations (ERs) of code using Large Language Models (LLMs), enabling ERs in open and constrained settings, and presents eight findings on ER generation and its applications in software engineering tasks.
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau,Kartikay Khandelwal,Naman Goyal,Vishrav Chaudhary,Guillaume Wenzek,Francisco Guzmán,Edouard Grave,Myle Ott,Luke Zettlemoyer,Veselin Stoyanov +9 more
- 01 Jul 2020
TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
LLVM: a compilation framework for lifelong program analysis & transformation
Chris Lattner,Vikram Adve +1 more
- 20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
Linting Xue,Noah Constant,Adam Roberts,Mihir Kale,Rami Al-Rfou,Aditya Siddhant,Aditya Barua,Colin Raffel +7 more
- 01 Jun 2021
TL;DR: This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.
On the resemblance and containment of documents
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.