TL;DR: An extensive systematic literature review of software clones in general and software clone detection in particular calls for an increased awareness of the potential benefits of software clone management, and identifies the need to develop semantic and model clone detection techniques.
Abstract: Context Reusing software by means of copy and paste is a frequent activity in software development. The duplicated code is known as a software clone and the activity is known as code cloning. Software clones may lead to bug propagation and serious maintenance problems. Objective This study reports an extensive systematic literature review of software clones in general and software clone detection in particular. Method We used the standard systematic literature review method based on a comprehensive set of 213 articles from a total of 2039 articles published in 11 leading journals and 37 premier conferences and workshops. Results Existing literature about software clones is classified broadly into different categories. The importance of semantic clone detection and model based clone detection led to different classifications. Empirical evaluation of clone detection tools/techniques is presented. Clone management, its benefits and cross cutting nature is reported. Number of studies pertaining to nine different types of clones is reported. Thirteen intermediate representations and 24 match detection techniques are reported. Conclusion We call for an increased awareness of the potential benefits of software clone management, and identify the need to develop semantic and model clone detection techniques. Recommendations are given for future research.
TL;DR: The public project's sequencing strategy involved producing a map of the human genome, and then pinning sequence to it to avoid errors in the sequence, especially in repetitive regions.
Abstract: The public project's sequencing strategy involved producing a map of the human genome, and then pinning sequence to it. This helps to avoid errors in the sequence, especially in repetitive regions.
TL;DR: A time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems.
Abstract: Despite the fact that duplicated fragments of code also called code clones are considered one of the prominent code smells that may exist in software, cloning is widely practiced in industrial development. The larger the system, the more people involved in its development and the more parts developed by different teams result in an increased possibility of having cloned code in the system. While there are particular benefits of code cloning in software development, research shows that it might be a source of various troubles in evolving software. Therefore, investigating and understanding clones in a software system is important to manage the clones efficiently. However, when the system is fairly large, it is challenging to identify and manage those clones properly. Among the various types of clones that may exist in software, research shows detection of near-miss clones where there might be minor to significant differences (e.g., renaming of identifiers and additions/deletions/modifications of statements) among the cloned fragments is costly in terms of time and memory. Thus, there is a great demand of state-of-the-art technologies in dealing with clones in software. Over the years, several tools have been developed to detect and visualize exact and similar clones. However, usually the tools are standalone and do not integrate well with a software developer’s workflow. In this thesis, first, a study is presented on the effectiveness of a fingerprint based data similarity measurement technique named ‘simhash’ in detecting clones in large scale code-base. Based on the positive outcome of the study, a time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems. The novel detection approach has been made available as a highly configurable and fully fledged standalone clone detection tool named ‘SimCad’, which can be configured for detection of clones in both source code and non-source code based data. Second, we show a robust use of the clone detection approach studied earlier by assembling its detection service as a portable library named ‘SimLib’. This library can provide tightly coupled (integrated) clone detection functionality to other applications as opposed to loosely coupled service provided by a typical standalone tool. Because of being highly configurable and easily extensible, this library allows the user to customize its clone detection process for detecting clones in data having diverse characteristics. We performed a user study to get some feedback on installation and use of the ‘SimLib’ API (Application Programming Interface) and to uncover its potential use as a third-party clone detection library. Third, we investigated on what tools and techniques are currently in use to detect and manage clones and understand their evolution. The goal was to find how those tools and techniques can be made available to a developer’s own software development platform for convenient identification, tracking and management of clones in the software. Based on that, we developed a clone-aware software development platform named ‘SimEclipse’ to promote the practical use of code clone research and to provide better support for clone management in software. Finally, we performed an evaluation on ‘SimEclipse’ by conducting a user study on its effectiveness, usability and information management. We believe that both researchers and developers would enjoy and utilize the benefit of using these tools in different aspect of code clone research and manage cloned code in software systems.
TL;DR: The study reported the use of clone detection in finding commonalities in the form of domain concepts in source code which will help analysts in understanding the design of the system for better maintenance.
Abstract: syntax trees and parse trees are frequently used representations when source code is to be transformed into tree structures. However, tools based on this approach suffer from large execution times when analysing a large source code base. The output is purely syntactic units of source code which are ready for refactoring. Tree based clone detection is capable of detecting clones in which the code is inserted or deleted i.e. type-3 clones. Yang [241] proposed one of the first approaches for finding the syntactic differences between two versions of the same program. The technique was based on grammar and builds a variant of a parse tree for both the versions. Detection is applied synchronously to both the trees and is based on the longest common subsequence method of dynamic programming. A limitation of this approach is that his differential comparator can only work for syntactically correct programs conforming to the grammar. Semantic Designs' CloneDR [23] is another tool which is able to detect exact and near miss clones using hashing and dynamic programming. The tool has different variants for different programming languages. The study reported the use of clone detection in finding commonalities in the form of domain concepts in source code which will help analysts in understanding the design of the system for better maintenance. SimScan [211] and ccdiml [20] are variations of CloneDR. ccdiml transforms the source to intermediate representation and SimScan applies subtree comparison on the parsed source code. The source code is parsed with the help of ANTLR parser generator. SimScan and ccdiml have been used to classify the evolution of source code clone fragments in Java and C source code files [227]. Falke et al. [55] and Tairas and Gray [220] used suffix tree to detect clones in code transformed into an AST. The technique has advantage of precision of syntax tree and high speed of suffix tree. 30 Gitchell and Tran [63] developed Sim which converts source programs to parse trees. Viewing parse trees as strings, the tool applied longest common subsequence and dynamic programming to assess similarity. Deckard by Jiang et al. [95] is based on computing characteristic vectors from the AST and clustering vectors which are close in Euclidean space by locality sensitive hashing. Deckard has been used in localizing the representation of clone groups [224] and has been used to detect behaviorally similar code [104]. Another application of Deckard is to assess the impact of code clone on defects in source code [189]. Asta [54] is an AST based tool which works on the phenomenon of structural abstraction of arbitrary sub trees of an AST. ClemanX [180,181] is an incremental AST based framework. The tool constructs characteristic vectors from AST subtrees and used locality sensitive hashing. Saebjornsen et al. [200] also used the same set of techniques to detect clones in assembly code. Anti-unification is used in three studies [29,31,151] to calculate the distance between two AST‟s and grouping the similar classes in one cluster. Anti-unification helps to discover common sub-expressions in source code represented as a tree. CloneDigger [31] is a language independent tool in which anti-unification is applied to XML representation of source code. CloneDetection, a tree based tool by Wahler et al. [233], is based on frequent itemset mining applied on XML representation of source code. Chilowicz et al. [355] developed a new technique to detect exact clones based on syntax tree fingerprinting. Shifting our focus to code clone management, CSeR (Code Segment Reuse) was developed by Jacob et al. [89] to check copy and paste induced clones in an integrated development environment. The tool was designed to compute clone differences interactively by checking if some piece of code was copy-pasted as the programmer was editing and typing the code. It works on the phenomenon of converting the immediate clone to an AST and computing the difference with the original in a bi-directional manner using metrics like the Levenshtein distance. Biegel and Diehl [26] introduced a novel way for fast and configurable code clone detection using pipelines. They developed JCCD, a flexible and customisable AST based clone detection tool in which several cascaded processors perform various steps of clone detection process. JCCD API parallelizes the detection process using multiple cores.