CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

doi:10.48550/arXiv.2305.11626

Journal Article10.48550/arXiv.2305.11626

CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search

Nikita Sorokin, +3 more

- 19 May 2023

- arXiv.org

- Vol. abs/2305.11626

3

TL;DR: In this article , a cross-consistency training (CCT) procedure is proposed to train language models on source code in different programming languages to find code snippets that operate identically but are written in different languages.

Abstract: We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1145/3661167.3661233

Leveraging Statistical Machine Translation for Code Search

Hung Ngoc Phan, +1 more

- 18 Jun 2024

TL;DR: Leveraging Statistical Machine Translation (SMT) for Code Search, Oracle4CS introduces a novel approach that integrates SMT with modern code search models. It utilizes a new code representation technique called ASTSum and a fresh code search approach to enhance code search performance.

...read moreread less

2

Journal Article•10.48550/arxiv.2408.04430

Large Language Models for cross-language code clone detection

Micheline Bénédicte Moumoula, +3 more

- 08 Aug 2024

TL;DR: This study investigates the effectiveness of Large Language Models (LLMs) and pre-trained embedding models for cross-lingual code clone detection, achieving high F1 scores with LLMs but outperforming them with embedding models, which provide suitable representations for state-of-the-art performance.

...read moreread less

Journal Article•10.1145/3691620.3695335

Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier

TL;DR: The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

...read moreread less

3.1K

...

Expand