Cross-project code clones in GitHub

doi:10.1007/S10664-018-9648-Z

Journal Article10.1007/S10664-018-9648-Z

Cross-project code clones in GitHub

Mohammad Gharehyazie, +6 more

- 01 Jun 2019

- Empirical Software Engineering

- Vol. 24, Iss: 3, pp 1538-1573

41

TL;DR: An in-depth empirical study of cloning in GitHub, and a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub that is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system.

Abstract: Code reuse has well-known benefits on code quality, coding efficiency, and maintenance. Open Source Software (OSS) programmers gladly share their own code and they happily reuse others’. Social programming platforms like GitHub have normalized code foraging via their common platforms, enabling code search and reuse across different projects. Removing project borders may facilitate more efficient code foraging, and consequently faster programming. But looking for code across projects takes longer and, once found, may be more challenging to tailor to one’s needs. Learning how much code reuse goes on across projects, and identifying emerging patterns in past cross-project search behavior may help future foraging efforts. Our contribution is two fold. First, to understand cross-project code reuse, here we present an in-depth empirical study of cloning in GitHub. Using Deckard, a popular clone finding tool, we identified copies of code fragments across projects, and investigate their prevalence and characteristics using statistical and network science approaches, and with multiple case studies. By triangulating findings from different analysis methods, we find that cross-project cloning is prevalent in GitHub, ranging from cloning few lines of code to whole project repositories. Some of the projects serve as popular sources of clones, and others seem to contain more clones than their fair share. Moreover, we find that ecosystem cloning follows an onion model: most clones come from the same project, then from projects in the same application domain, and finally from projects in different domains. Second, we utilized these results to develop a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub. The tool is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system. We describe the tool and make it publicly available at http://clone-det.ictic.sharif.edu/ .

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

The Adverse Effects of Code Duplication in Machine Learning Models of Code

Miltiadis Allamanis

- 16 Dec 2018

- arXiv: Software Engineering

TL;DR: The effects of code duplication on machine learning models are explored, showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machineLearning models of code are used by software engineers.

...read moreread less

233

•Proceedings Article•10.1145/3338906.3338909

Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions

James C. Davis, +4 more

- 12 Aug 2019

TL;DR: It is reported that developers’ belief in a regex lingua franca is understandable but unfounded, and though most regexes compile across language boundaries, 15% exhibit semantic Differences across languages and 10% exhibit performance differences across languages.

...read moreread less

56

•Proceedings Article•10.1109/ICSE43902.2021.00083

Centris: A Precise and Scalable Approach for Identifying Modified Open-Source Software Reuse

Seunghoon Woo, +4 more

- 22 May 2021

TL;DR: Centris as mentioned in this paper identifies modified OSS reuse in the presence of nested OSS components by segmenting an OSS code base and detecting the reuse of a unique part of the OSS only.

...read moreread less

29

•Journal Article•10.1007/S10664-020-09825-8

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Peter Pickerill, +3 more

- 25 Apr 2019

- arXiv: Software Engineering

TL;DR: PHANTOM as mentioned in this paper extracts five measures from GitHub logs and converts each measure into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

...read moreread less

21

•Journal Article•10.1007/S10664-020-09825-8

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Peter Pickerill, +4 more

- 01 Jul 2020

- Empirical Software Engineering

TL;DR: PHANTOM as mentioned in this paper extracts five measures from GitHub logs and transforms each measure into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

...read moreread less

18

...

Expand

References

•Journal Article•10.1109/TSE.2002.1019480

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Toshihiro Kamiya, +2 more

- 01 Jul 2002

- IEEE Transactions on Software Engineerin...

TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.

...read moreread less

1.9K

Journal Article•10.1109/TSE.2011.104

GenProg: A Generic Method for Automatic Software Repair

C. Le Goues, +3 more

- 01 Jan 2012

- IEEE Transactions on Software Engineerin...

TL;DR: This paper describes GenProg, an automated method for repairing defects in off-the-shelf, legacy programs without formal specifications, program annotations, or special coding practices, and analyzes the generated repairs qualitatively and quantitatively to demonstrate the process efficiently produces evolved programs that repair the defect.

...read moreread less

1.2K

•Proceedings Article•10.1109/ICSE.2007.30

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Lingxiao Jiang, +3 more

- 24 May 2007

TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.

...read moreread less

1.2K

•Journal Article•10.1016/J.SCICO.2009.02.007

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Chanchal K. Roy, +2 more

- 01 May 2009

- Science of Computer Programming

TL;DR: A qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools is provided, and a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors are evaluated.

...read moreread less

1.1K

Proceedings Article•10.1145/2145204.2145396

Social coding in GitHub: transparency and collaboration in an open software repository

Laura Dabbish, +3 more

- 11 Feb 2012

TL;DR: It is found that people make a surprisingly rich set of social inferences from the networked activity information in GitHub, such as inferring someone else's technical goals and vision when they edit code, or guessing which of several similar projects has the best chance of thriving in the long term.

...read moreread less

1K

...

Expand

Cross-project code clones in GitHub

Chat with Paper

AI Agents for this Paper

Citations

The Adverse Effects of Code Duplication in Machine Learning Models of Code

Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions

Centris: A Precise and Scalable Approach for Identifying Modified Open-Source Software Reuse

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

References

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

GenProg: A Generic Method for Automatic Software Repair

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

Social coding in GitHub: transparency and collaboration in an open software repository

Related Papers (5)

DéjàVu: a map of code duplicates on GitHub

Code duplication on stack overflow

Quality and productivity outcomes relating to continuous integration in GitHub

Curating GitHub for engineered software projects

Scalable Relevant Project Recommendation on GitHub