Journal Article10.48550/arxiv.2408.04430
Large Language Models for cross-language code clone detection
Micheline Bénédicte Moumoula,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé +3 more
- 08 Aug 2024
TL;DR: This study investigates the effectiveness of Large Language Models (LLMs) and pre-trained embedding models for cross-lingual code clone detection, achieving high F1 scores with LLMs but outperforming them with embedding models, which provide suitable representations for state-of-the-art performance.
read more
Abstract: With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction with the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We investigate the capabilities of four (04) LLMs and eight (08) prompts for the identification of cross-lingual code clones. Additionally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. Both studies (based on LLMs and Embedding models) are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.98, for straightforward programming examples (e.g., from XLCoST). However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of code clones in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~2 and ~24 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Table 3. Performance Comparison of LLMs on the task of Cross-Lingual Code Clone Detection - based on the "Simple Prompt" 
Table 2. Performance of GPT-3.5.-Turbo with Various Prompts on the Task of Cross-lingual Code Clone Detection 
Table 4. LLMs performance comparison with the improved prompt, designed based on LLMs common behavior 
Table 5. Performance of GPT-3.5-turbo on the task of cross-lingual code clone detection - Detailed F1 scores by programming language pairs (Java - Lang-X ) 
Table 1. List of Prompts Designed to Assess the Effectiveness of LLMs for the Task of Cross-Lingual Code Clone Detection 
Table 6. Comparison of LLM, Baselines, and Binary Classifier Performance
References
•Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Support-Vector Networks
Corinna Cortes,Vladimir Vapnik +1 more
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Nearest neighbor pattern classification
Thomas M. Cover,Peter E. Hart +1 more
TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
CCFinder: a multilinguistic token-based code clone detection system for large scale source code
TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.
DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones
Lingxiao Jiang,Ghassan Misherghi,Zhendong Su,Stéphane Glondu +3 more
- 24 May 2007
TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.