Large Language Models for cross-language code clone detection

doi:10.48550/arxiv.2408.04430

Journal Article10.48550/arxiv.2408.04430

Large Language Models for cross-language code clone detection

Micheline Bénédicte Moumoula, +3 more

- 08 Aug 2024

TL;DR: This study investigates the effectiveness of Large Language Models (LLMs) and pre-trained embedding models for cross-lingual code clone detection, achieving high F1 scores with LLMs but outperforming them with embedding models, which provide suitable representations for state-of-the-art performance.

Abstract: With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction with the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We investigate the capabilities of four (04) LLMs and eight (08) prompts for the identification of cross-lingual code clones. Additionally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. Both studies (based on LLMs and Embedding models) are evaluated using two widely used cross-lingual datasets, XLCoST and CodeNet. Our results show that LLMs can achieve high F1 scores, up to 0.98, for straightforward programming examples (e.g., from XLCoST). However, they not only perform less well on programs associated with complex programming challenges but also do not necessarily understand the meaning of code clones in a cross-lingual setting. We show that embedding models used to represent code fragments from different programming languages in the same representation space enable the training of a basic classifier that outperforms all LLMs by ~2 and ~24 percentage points on the XLCoST and CodeNet datasets, respectively. This finding suggests that, despite the apparent capabilities of LLMs, embeddings provided by embedding models offer suitable representations to achieve state-of-the-art performance in cross-lingual code clone detection.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Table 3. Performance Comparison of LLMs on the task of Cross-Lingual Code Clone Detection - based on the "Simple Prompt"

Table 2. Performance of GPT-3.5.-Turbo with Various Prompts on the Task of Cross-lingual Code Clone Detection

Table 4. LLMs performance comparison with the improved prompt, designed based on LLMs common behavior

Table 5. Performance of GPT-3.5-turbo on the task of cross-lingual code clone detection - Detailed F1 scores by programming language pairs (Java - Lang-X )

Table 1. List of Prompts Designed to Assess the Effectiveness of LLMs for the Task of Cross-Lingual Code Clone Detection

Table 6. Comparison of LLM, Baselines, and Binary Classifier Performance

References

•Journal Article•10.1023/A:1022627411411

Support-Vector Networks

Corinna Cortes, +1 more

- 15 Sep 1995

- Machine Learning

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

...read moreread less

42K

•Journal Article•10.1109/TIT.1967.1053964

Nearest neighbor pattern classification

Thomas M. Cover, +1 more

- 01 Jan 1967

- IEEE Transactions on Information Theory

TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

...read moreread less

15.2K

•Journal Article•10.1109/TSE.2002.1019480

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Toshihiro Kamiya, +2 more

- 01 Jul 2002

- IEEE Transactions on Software Engineerin...

TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.

...read moreread less

1.9K

•Proceedings Article•10.1109/ICSE.2007.30

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Lingxiao Jiang, +3 more

- 24 May 2007

TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.

...read moreread less

1.2K

...

Expand

Large Language Models for cross-language code clone detection

Chat with Paper

AI Agents for this Paper

Figures

References

Scikit-learn: Machine Learning in Python

Support-Vector Networks

Nearest neighbor pattern classification

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones