Journal Article10.48550/arXiv.2305.11626
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
3
TL;DR: In this article , a cross-consistency training (CCT) procedure is proposed to train language models on source code in different programming languages to find code snippets that operate identically but are written in different languages.
read more
Abstract: We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Leveraging Statistical Machine Translation for Code Search
Hung Ngoc Phan,Ali Jannesari +1 more
- 18 Jun 2024
TL;DR: Leveraging Statistical Machine Translation (SMT) for Code Search, Oracle4CS introduces a novel approach that integrates SMT with modern code search models. It utilizes a new code representation technique called ASTSum and a fresh code search approach to enhance code search performance.
2
Large Language Models for cross-language code clone detection
Micheline Bénédicte Moumoula,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé +3 more
- 08 Aug 2024
TL;DR: This study investigates the effectiveness of Large Language Models (LLMs) and pre-trained embedding models for cross-lingual code clone detection, achieving high F1 scores with LLMs but outperforming them with embedding models, which provide suitable representations for state-of-the-art performance.
Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier
Micheline Bénédicte Moumoula,Abdoul Kader Kaboré,Jacques Klein,Tegawendé F. Bissyandé +3 more
- 18 Oct 2024
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu,Myle Ott,Naman Goyal,Jingfei Du,Mandar Joshi,Danqi Chen,Omer Levy,Michael Lewis,Luke Zettlemoyer,Veselin Stoyanov +9 more
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
•Posted Content
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TL;DR: The Stanford Question Answering Dataset (SQuAD) as mentioned in this paper is a reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
5.8K
•Posted Content
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick S. H. Lewis,Ethan Perez,Aleksandra Piktus,Fabio Petroni,Vladimir Karpukhin,Naman Goyal,Heinrich Küttler,Michael Lewis,Wen-tau Yih,Tim Rocktäschel,Sebastian Riedel,Douwe Kiela +11 more
TL;DR: A general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation, and finds that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
Natural Questions: A Benchmark for Question Answering Research
Tom Kwiatkowski,Jennimaria Palomaki,Olivia Redfield,Michael Collins,Ankur P. Parikh,Chris Alberti,Danielle Epstein,Illia Polosukhin,Jacob Devlin,Kenton Lee,Kristina Toutanova,Llion Jones,Matthew Kelcey,Ming-Wei Chang,Andrew M. Dai,Jakob Uszkoreit,Quoc V. Le,Slav Petrov +17 more
TL;DR: The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.