TL;DR: VulData7 is an extensible framework and dataset of real vulnerabilities, automatically collected from software archives, that retrieves fixes for 1,600 out of the 2,800 reported vulnerabilities of the 4 security critical open source systems.
Abstract: Studies on security vulnerabilities require the analysis, investigation and comprehension of real vulnerable code instances. However, collecting and experimenting with a sufficient number of such instances is challenging. To cope with this issue, we developed VulData7, an extensible framework and dataset of real vulnerabilities, automatically collected from software archives. The current version of the dataset contains all reported vulnerabilities (in the NVD database) of 4 security critical open source systems, i.e., Linux Kernel, WireShark, OpenSSL, SystemD. For each vulnerability, VulData7 provides the vulnerability report data (description, CVE number, CWE number, CVSS severity score and others), the vulnerable code instance (list of versions), and when available its corresponding patches (list of fixing commits) and the files (before and after fix). VulData7 is automated, flexible and easily extensible. Once configured, it extracts and links information from the related software archives (through Git and NVD reports) to create a dataset that is continuously updated with the latest information available. Currently, VulData7 retrieves fixes for 1,600 out of the 2,800 reported vulnerabilities of the 4 systems. The framework also supports the collection of additional software defects and aims at easing empirical studies and analyses. We believe that our framework is a valuable resource for both developers and researchers interested in secure software development. Vul-Data7 can also serve educational purposes and trigger research on source code analysis. VulData7 is publicly available at: https://github.com/electricalwind/data7
TL;DR: This paper proposes a machine learning based approach for automating the validation process for automatic code clone validation and finds it has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges.
Abstract: A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.
TL;DR: This work explores a forward-looking approach that is able to infer groups of likely module dependencies that can anticipate architectural smells in a future system version, and focuses on dependency-related smells, such as Cyclic Dependency and Hub-like Dependency, which fit well with the link prediction model.
Abstract: Software systems naturally evolve, and this evolution often brings design problems that cause system degradation. Architectural smells are typical symptoms of such problems, and several of these smells are related to undesired dependencies among modules. The early detection of these smells is important for developers, because they can plan ahead for maintenance or refactoring efforts, thus preventing system degradation. Existing tools for identifying architectural smells can detect the smells once they exist in the source code. This means that their undesired dependencies are already created. In this work, we explore a forward-looking approach that is able to infer groups of likely module dependencies that can anticipate architectural smells in a future system version. Our approach considers the current module structure as a network, along with information from previous versions, and applies link prediction techniques (from the field of social network analysis). In particular, we focus on dependency-related smells, such as Cyclic Dependency and Hub-like Dependency, which fit well with the link prediction model. An initial evaluation with two open-source projects shows that, under certain considerations, the predictions of our approach are satisfactory. Furthermore, the approach can be extended to other types of dependency-based smells or metrics.
TL;DR: ACRE takes a regular expression as input and performs 11 different checks on the regular expression, which are based on common mistakes, and has found errors in 283 out of 826 regular expressions.
Abstract: Regular expressions are extensively used to process strings. The regular expression language is concise which makes it easy for developers to use but also makes it easy for developers to make mistakes. Since regular expressions are compiled at run-time, the regular expression compiler does not give any feedback on potential errors. This paper describes ACRE - Automatic Checking of Regular Expressions. ACRE takes a regular expression as input and performs 11 different checks on the regular expression. The checks are based on common mistakes. Among the checks are checks for incorrect use of character sets (enclosed by []), wildcards (represented by.), and line anchors (^ and $). ACRE has found errors in 283 out of 826 regular expressions. Each of the 11 checks found at least seven errors. The number of false reports is moderate: 46 of the regular expressions contained a false report. ACRE is simple to use: the user enters a regular expressions and presses the check button. Any violations are reported back to the user with the incorrect portion of the regular expression highlighted. For 9 of the 11 checks, an example accepted string is generated that further illustrates the error.
TL;DR: This paper investigates whether transitive rules on the evolutionary couplings detected using the traditional mechanism provide us with 13.96% higher recall and 5.56% higher precision in detecting future co-change candidates when compared with a state-of-the-art technique.
Abstract: If two or more program entities (such as files, classes, methods) co-change (i.e., change together) frequently during software evolution, then it is likely that these two entities are coupled (i.e., the entities are related). Such a coupling is termed as evolutionary coupling in the literature. The concept of traditional evolutionary coupling restricts us to assume coupling among only those entities that changed together in the past. The entities that did not co-change in the past might also have coupling. However, such couplings can not be retrieved using the current concept of detecting evolutionary coupling in the literature. In this paper, we investigate whether we can detect such couplings by applying transitive rules on the evolutionary couplings detected using the traditional mechanism. We call these couplings that we detect using our proposed mechanism as transitive evolutionary couplings. According to our research on thousands of revisions of four subject systems, transitive evolutionary couplings combined with the traditional ones provide us with 13.96% higher recall and 5.56% higher precision in detecting future co-change candidates when compared with a state-of-the-art technique.
TL;DR: In this paper, a multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts to identify the programming language of code snippets written in 21 different programming languages.
Abstract: Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI-a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.
TL;DR: The proposed algorithmic foundation for tool support to identify composite commits is proposed, and it is found that the algorithm can determine whether or not a commit is composite.
Abstract: Composite commits are a common mistake in the use of version control software. A composite commit groups many unrelated tasks, rendering the commit difficult for developers to understand, revert, or integrate and for empirical researchers to analyse. We propose an algorithmic foundation for tool support to identify such composite commits. Our algorithm computes both a program dependence graph and the changes to the abstract syntax tree for the files that have been changed in a commit. Our algorithm then groups these fine-grained changes according to the slices through the dependence graph they belong to. To evaluate our technique, we analyse and refine an established dataset of Java commits, the results of which we also make available. We find that our algorithm can determine whether or not a commit is composite. For the majority of commits, this analysis takes but a few seconds. The parts of a commit that our algorithm identifies do not map directly to the commit's tasks. The parts tend to be smaller, but stay within their respective tasks.
TL;DR: Graal is presented, which empowers users with a customizable, scalable and incremental approach to conduct source code analysis and enables relating the obtained results with other software project data.
Abstract: Source code analysis tools are designed to analyze code artifacts with different intents, which span from improving the quality and security of the software to easing refactoring and reverse engineering activities. However, most tools do not come with features to periodically schedule their analysis or to be executed on a battery of repositories, and lack support to combine their results with other analysis tools. Thus, researchers and practitioners are often forced to develop ad-hoc scripts to meet their needs. This comes at the risk of obtaining wrong results (because of the lack of testing) and of hindering replication by other research teams. In addition, the resulting scripts are often not meant to be customized nor designed for incrementality, scalability and extensibility. In this paper we present Graal, which empowers users with a customizable, scalable and incremental approach to conduct source code analysis and enables relating the obtained results with other software project data. Graal leverages on and extends the functionalities of GrimoireLab, a strong free software tool developed by Bitergia, a company devoted to offer commercial software development analytics, and part of the CHAOSS project of the Linux Foundation.
TL;DR: This paper starts a simple tool which over the years evolved into a powerful toolchain, which is composed of many small tools connected by scripts and communicating via files, and presents design decisions made and lessons learned, both positive and negative ones.
Abstract: Highly configurable software systems allow the efficient and reliable development of similar software variants based on a common code base. The C preprocessor CPP, which uses source code annotations that enable conditional compilation, is a simple yet powerful text-based tool for implementing such systems. However, since annotations interfere with the actual source code, the CPP has often been accused of being a source of errors and increased maintenance effort. In our research, we have been curious about whether high-level patterns of CPP misuse (i.e., code smells) can be identified, how they evolve, and whether they really hinder maintenance. To support this research, we started a simple tool which over the years evolved into a powerful toolchain. This evolution was possible because our toolchain is not monolithic, but is composed of many small tools connected by scripts and communicating via files. Moreover, we reused existing tools whenever possible and developed our own solutions only as a last resort. In this paper, we report our experiences of building this toolchain. In particular, we present design decisions we made and lessons learned, both positive and negative ones. We hope that this not only stimulates discussion and (in the best case) attracts more researchers in using our tools. Rather, we also want to encourage others to put emphasis on building tools instead of considering them "yet another research prototype".
TL;DR: In this paper, the authors present an approach to automatically translate critical sections of high level Java bytecode to C code, so that more effective obfuscations can be resorted to, while a developer can still work with a single programming language, i.e., Java.
Abstract: Code obfuscation is a popular approach to turn program comprehension and analysis harder, with the aim of mitigating threats related to malicious reverse engineering and code tampering. However, programming languages that compile to high level bytecode (e.g., Java) can be obfuscated only to a limited extent. In fact, high level bytecode still contains high level relevant information that an attacker might exploit. In order to enable more resilient obfuscations, part of these programs might be implemented with programming languages (e.g., C) that compile to low level machine-dependent code. In fact, machine code contains and leaks less high level information and it enables more resilient obfuscations. In this paper, we present an approach to automatically translate critical sections of high level Java bytecode to C code, so that more effective obfuscations can be resorted to. Moreover, a developer can still work with a single programming language, i.e., Java.
TL;DR: Results show that method behavior with respect to stereotype is highly stable and constant over time.
Abstract: A study of how method roles evolve during the lifetime of a software system is presented. Evolution is examined by analyzing when the stereotype of a method changes. Stereotypes provide a high-level categorization of a method's behavior and role, and also provide insight into how a method interacts with its environment and carries out tasks. The study covers 50 open-source systems and 6 closed-source systems. Results show that method behavior with respect to stereotype is highly stable and constant over time. Overall, out of all the history examined, only about 10% of changes to methods result in a change in their stereotype. Examples of methods that change stereotype are further examined. A select number of these types of changes are indicators of code smells.
TL;DR: This paper highlights several issues met when blindly chaining different kind of obfuscation and optimization passes, emphasizing the need of a formal model to combine them, and proposes a non-intrusive formalism to leverage on sequential pass management techniques.
Abstract: Code obfuscation is the de facto standard to protect intellectual property when delivering code in an unmanaged environment. It relies on additive layers of code tangling techniques, white-box encryption calls and platform-specific or tool-specific countermeasures to make it harder for a reverse engineer to access critical pieces of data or to understand core algorithms. The literature provides plenty of different obfuscation techniques that can be used at compile time to transform data or control flow in order to provide some kind of protection against different reverse engineering scenarii. Scheduling code transformations to optimize a given metric is known as the pass scheduling problem, a problem known to be NP-hard, but solved in a practical way using hard-coded sequences that are generally satisfactory. Adding code obfuscation to the problem introduces two new dimensions. First, as a code obfuscator needs to find a balance between obfuscation and performance, pass scheduling becomes a multi-criteria optimization problem. Second, obfuscation passes transform their inputs in unconventional ways, which means some pass combinations may not be desirable or even valid. This paper highlights several issues met when blindly chaining different kind of obfuscation and optimization passes, emphasizing the need of a formal model to combine them. It proposes a non-intrusive formalism to leverage on sequential pass management techniques. The model is validated on real-world scenarii gathered during the development of an industrial-strength obfuscator on top of the LLVM compiler infrastructure.
TL;DR: The rationale behind the CoRA tool is presented, followed by a tool overview and its implementation details, and an example use case shows how the tool is used to locate clones of a particular feature.
Abstract: As part of a module re-unification project of an industrial partner's code, spanning one systems and two derivative systems, the feature-clone variants across these systems have to be extracted, to be later re-unified as singular code elements for re-use. To assist developers with this task, the CoRA (The Code Re-unification Application) tool was designed and implemented. An approach, and the subsequent design of the tool was derived from reflection on manual feature-location/clonedetection efforts on the company's systems, in the first phase of an action research cycle where the approach/implementation will be iteratively trialled, and subsequently refined, in-situ. A pilot study is discussed that leads to the proposed tool. The tool combines a hybrid (textual-static) feature location technique and a textual clone detection technique for featureclone identification. In this paper, the rationale behind the CoRA tool is presented, followed by a tool overview and its implementation details. Finally, an example use case shows how the tool is used to locate clones of a particular feature.
TL;DR: The idea to integrate syntax-based clone detection into workbenches for language engineering, which comes as a free byproduct of the grammar specification, is explored.
Abstract: Developers often practice re-use by copying and pasting code. Copied and pasted code is also known as clones. Clones may be found in all programming languages. Automated clone detection may help to detect clones in order to support software maintenance and language design. Syntax-based clone detectors find similar syntax subtrees and, hence, are guaranteed to yield only syntactic clones. They are also known to have high precision and good recall. Developing a syntax-based clone detector for each language from scratch may be an expensive task. In this paper, we explore the idea to integrate syntax-based clone detection into workbenches for language engineering. Such workbenches allow developers to create their own domain-specific language or to create parsers for existing languages. With the integration of clone detection into these workbenches, a clone detector comes as a free byproduct of the grammar specification. The effort is spent only once for the workbench and not multiple times for every language built with the workbench. We report our lessons learned in applying this idea for three language workbenches: the popular parser generator ANTLR and two language workbenches for domain-specific languages, namely, MPS, developed by JetBrains, and Xtext, which is based on the Eclipse Modeling Framework.
TL;DR: Two new Frama-C plug-ins, RECKA for automatic annotation of CUDA kernels arguments with the restrict keyword, and RPromF for scalar replacement in OpenACC and OpenMP 4.0/4.5 codes for GPU are presented.
Abstract: Pointer aliasing still hinders compiler optimizations. The ISO C standard 99 has added the restrict keyword that allows programmer to specify non-aliasing as an aid to the compiler's optimizer. The task of annotating pointers with the restrict keyword is still left to the programmer and this task is, in general, tedious and prone to errors. Scalar replacement is an optimization widely used by compilers. In this paper, we present two new Frama-C plug-ins, RECKA for automatic annotation of CUDA kernels arguments with the restrict keyword, and RPromF for scalar replacement in OpenACC and OpenMP 4.0/4.5 codes for GPU. More specifically, RECKA works as follows: (i) an alias analysis is performed on CUDA kernels and their callers; (ii) if not found any alias then CUDA kernels are cloned, the clones are renamed and their arguments are annotated with the restrict qualifier; and (iii) instructions are added to kernels call sites to perform at runtime a less-than check analysis on kernel actuals parameters and determine if the clone must be called or the original one. RPromF includes five main steps: (i) OpenACC/OpenMP offloading regions are identified; (ii) functions containing these offloading codes and their callers are analyzed to check that there is no alias; (iii) if there is no alias then the offloading codes are cloned; (iv) clone's instructions are analyzed to retrieve data reuse information and perform scalar replacement; and instructions are added to be able to use the optimized clone whenever possible. We have evaluated the two plug-ins on PolyBench benchmark suite. The results show that both scalar replacement and the usage of restrict keyword are effective for improving the overall performance of OpenACC, OpenMP 4.0/4.5 and CUDA codes.
TL;DR: ATARI is presented, a new adaptive approach to association rule mining that considers a dynamic selection of the relevant transactions, which can be viewed as a further constrained version of targeted association rulemining, in which as few as a single transaction might be considered when determining change impact.
Abstract: As the complexity of a software system grows, it becomes increasingly difficult for developers to be aware of all the dependencies that exist between artifacts (e.g., files or methods) of the system. Change impact analysis helps to overcome this problem, as it recommends to a developer relevant source-code artifacts related to her current changes. Association rule mining has shown promise in determining change impact by uncovering relevant patterns in the system's change history. State-of-the-art change impact mining algorithms typically make use of a change history of tens of thousands of transactions. For efficiency, targeted association rule mining focuses on only those transactions potentially relevant to answering a particular query. However, even targeted algorithms must consider the complete set of relevant transactions in the history. This paper presents ATARI, a new adaptive approach to association rule mining that considers a dynamic selection of the relevant transactions. It can be viewed as a further constrained version of targeted association rule mining, in which as few as a single transaction might be considered when determining change impact. Our investigation of adaptive change impact mining empirically studies seven algorithm variants. We show that adaptive algorithms are viable, can be just as applicable as the start-of-the-art complete-history algorithms, and even outperform them for certain queries. However, more important than the direct comparison, our investigation lays necessary groundwork for the future study of adaptive techniques and their application to challenges such as the on-the-fly style of impact analysis that is needed at the GitHub-scale.
TL;DR: Experiments on several real multithreaded data-processing applications show that POI succeeded in reducing, on average, about 37% of race detection overheads, which the load-distribution policy of Parallel FastTrack would impose.
Abstract: Multithreaded programs are prone to dataraces. Dataraces are known to be hard to detect and reproduce by manual effort, although they often have detrimental effects on program reliability. Automated techniques are thus demanded for detecting dataraces efficiently and precisely. There have been proposed a lot of datarace detectors so far, among which dynamic ones are promising because of their precision. However, existing dynamic race detectors incur high race-checking overheads. Even a state-of-the-art dynamic race detector, called Parallel FastTrack, fails to efficiently detect races under certain conditions, despite its attempt to parallelize race detection for efficiency. In this paper, we propose an efficient and precise parallel race detector. For our proposal, we first experimentally reveal that the load-distribution policy of Parallel FastTrack tends to skew race-checking loads to a few detection threads. We then present a simple but effective technique, called POI, for balancing race-checking loads among detection threads. POI takes race-checking loads of each detection thread into account and reduces the load skew by making each detection thread manage almost the same number of memory addresses to be checked. Experiments on several real multithreaded data-processing applications show that POI succeeded in reducing, on average, about 37% of race detection overheads, which the load-distribution policy of Parallel FastTrack would impose.
TL;DR: This paper applies fine-grained slicing techniques to the models generated from the Rebel modeling language before passing them on to an SMT solver, allowing us to verify larger problem instances and with higher path bounds than with unsliced models.
Abstract: In this paper, we apply fine-grained slicing techniques to the models generated from the Rebel modeling language before passing them on to an SMT solver. We show that our slicing techniques have a significant positive effect on performance, allowing us to verify larger problem instances and with higher path bounds than with unsliced models. For small and shallow instances, however, the overhead of slicing dominates verification time, and slicing should not be resorted to.
TL;DR: The paper identifies open problems that have yet to receive significant attention from the scientific community, yet which have potential for profound real world impact and are ripe for exploration and that would make excellent topics for research projects.
Abstract: This paper describes some of the challenges and opportunities when deploying static and dynamic analysis at scale, drawing on the authors' experience with the Infer and Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, impacting billions of people worldwide. The paper identifies open problems that have yet to receive significant attention from the scientific community, yet which have potential for profound real world impact, formulating these as research questions that, we believe, are ripe for exploration and that would make excellent topics for research projects. Note: This paper accompanies the authors' joint keynote at the 18th IEEE International Working Conference on Source Code Analysis and Manipulation, September 23rd-24th, 2018 - Madrid, Spain.
TL;DR: This work encodes Java methods in code repositories into path constraints via symbolic analysis and leverages SMT solvers to find the methods whose path constraints can satisfy the given input/output examples for the Java language.
Abstract: As the quality and quantity of open source code increase, semantics-based code search has become an emerging need for software developers to retrieve and reuse existing source code. We present an approach of semantics-based code search using input/output examples for the Java language. Our approach encodes Java methods in code repositories into path constraints via symbolic analysis and leverages SMT solvers to find the methods whose path constraints can satisfy the given input/output examples. Our approach extends the applicability of the semantics-based search technology to more general Java code compared with existing methods. To evaluate our approach, we encoded 1228 methods from GitHub and applied semantics-based code search on 35 queries extracted from Stack Overflow. Correct method code for 29 queries was obtained during the search and the average search time was just about 48 seconds.
TL;DR: The engineering aspects of an open source automated refactoring tool called Optimize Streams is described that assists developers in writing optimal stream software in a semantics-preserving fashion and is implemented as a plug-in to the popular Eclipse IDE, using both the WALA and SAFE frameworks.
Abstract: Streaming APIs are pervasive in mainstream Object-Oriented languages and platforms. For example, the Java 8 Stream API allows for functional-like, MapReduce-style operations in processing both finite, e.g., collections, and infinite data structures. However, using this API efficiently involves subtle considerations like determining when it is best for stream operations to run in parallel, when running operations in parallel can be less efficient, and when it is safe to run in parallel due to possible lambda expression side-effects. In this paper, we describe the engineering aspects of an open source automated refactoring tool called Optimize Streams that assists developers in writing optimal stream software in a semantics-preserving fashion. Based on a novel ordering and typestate analysis, the tool is implemented as a plug-in to the popular Eclipse IDE, using both the WALA and SAFE frameworks. The tool was evaluated on 11 Java projects consisting of ~642 thousand lines of code, where we found that 36.31% of candidate streams were refactorable, and an average speedup of 1.55 on a performance suite was observed. We also describe experiences gained from integrating three very different static analysis frameworks to provide developers with an easy-to-use interface for optimizing their stream code to its full potential.
TL;DR: A block-based programming language and environment focused on usability, learnability, and understandability is created and embedded its programming environment in a state-of-the-art robot simulator and concrete insights gained via longitudinal usage are discussed.
Abstract: Many robotic tasks in small manufacturing sites are quite simple. For example, a pick and place task requires only a few common commands. Unfortunately, the standard languages and programming environments for industrial robots are complex, making even these simple tasks nearly impossible for novices. To enable novices to program simple tasks we created a block-based programming language and environment focused on usability, learnability, and understandability and embedded its programming environment in a state-of-the-art robot simulator. By using this high-fidelity prototype over the course of a year in a case study, a user study, and for countless demonstrations we have gained many concrete insights. In this paper we discuss the details of the language, the design of its programming environment, and concrete insights gained via longitudinal usage.
TL;DR: This paper extracts periodic experience metrics capturing the previous activities of developers on source files and investigates the explanatory effect of these metrics on defects, and calculates periodic developer experience metrics and churn metrics at two granularity levels: file level and commit level.
Abstract: Defect prediction studies have proposed several data-driven approaches, and recently, this field has put more emphasis on whether the people factor is associated software defects. Developer metrics can capture experience, code ownership, coding skills and techniques, and commit activities. These metrics have so far been measured at a specified snapshot of the codebase although developer's knowledge on a source module could change over time. In this paper, we propose to measure periodic developer experience with regard to contextual knowledge on files and directories. We extract periodic experience metrics capturing the previous activities of developers on source files and investigate the explanatory effect of these metrics on defects. We also use activity-based (churn) metrics to observe the performance of both metric types on defect prediction. We used two large-scale open source projects, Lucene and Jackrabbit, for model evaluation. We calculate periodic developer experience metrics and churn metrics at two granularity levels: file level and commit level. We build the models using five popular machine learning algorithms in defect prediction literature. The models with the two best performing algorithms are assessed in terms of Precision, Recall, False Positive Rate, and F-measure. The set of metrics that explains software defects the best is also identified using correlation-based feature selection method. Results show that periodic developer experience metrics extracted at file level are good merits for defect prediction, accompanied with churn. When there is not enough data to extract the contextual knowledge of developers on source files, churn metrics play an important role on defect prediction.
TL;DR: This paper systematically compare five widely adopted static algorithms - implemented by the npm call graph, IBM WALA, Google Closure Compiler, Approximate Call Graph, and Type Analyzer for JavaScript tools - for building JavaScript call graphs on 26 WebKit SunSpider benchmark programs and 6 real-world Node.js modules.
Abstract: The popularity and wide adoption of JavaScript both at the client and server side makes its code analysis more important than ever before. Most of the algorithms for vulnerability analysis, coding issue detection, or type inference rely on the call graph representation of the underlying program. Despite some obvious advantages of dynamic analysis, static algorithms should also be considered for call graph construction as they do not require extensive test beds for programs and their costly execution and tracing. In this paper, we systematically compare five widely adopted static algorithms - implemented by the npm call graph, IBM WALA, Google Closure Compiler, Approximate Call Graph, and Type Analyzer for JavaScript tools - for building JavaScript call graphs on 26 WebKit SunSpider benchmark programs and 6 real-world Node.js modules. We provide a performance analysis as well as a quantitative and qualitative evaluation of the results. We found that there was a relatively large intersection of the found call edges among the algorithms, which proved to be 100% precise. However, most of the tools found edges that were missed by all others. ACG had the highest precision followed immediately by TAJS, but ACG found significantly more call edges. As for the combination of tools, ACG and TAJS together covered 99% of the found true edges by all algorithms, while maintaining a precision as high as 98%. Only two of the tools were able to analyze up-to-date multi-file Node.js modules due to incomplete language features support. They agreed on almost 60% of the call edges, but each of them found valid edges that the other missed.
TL;DR: In this paper, a machine learning-based approach was proposed to automatically extract sources and sinks from arbitrary Java libraries, exploiting several different features based on semantic, syntactic, intra-procedural dataflow and class-hierarchy traits embedded into the bytecode.
Abstract: In the last decade, data security has become a primary concern for an increasing amount of companies around the world. Protecting the customer's privacy is now at the core of many businesses operating in any kind of market. Thus, the demand for new technologies to safeguard user data and prevent data breaches has increased accordingly. In this work, we investigate a machine learning-based approach to automatically extract sources and sinks from arbitrary Java libraries. Our method exploits several different features based on semantic, syntactic, intra-procedural dataflow and class-hierarchy traits embedded into the bytecode to distinguish sources and sinks. The performed experiments show that, under certain conditions and after some preprocessing, sources and sinks across different libraries share common characteristics that allow a machine learning model to distinguish them from the other library methods. The prototype model achieved remarkable results of 86% accuracy and 81% F-measure on our validation set of roughly 600 methods.
TL;DR: This paper proposes a novel model CroLSim, able to detect similar software applications across different programming languages with a mean average precision rate of 0.65 and an average confidence rate of 3.6, which outperforms all related existing approaches with a significant performance improvement.
Abstract: In today's open source era, developers look forsimilar software applications in source code repositories for anumber of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similarapplications written in the same programming language, there isa marked lack of studies for finding similar software applicationswritten in different languages. In this paper, we fill the gapby proposing a novel modelCroLSimwhich is able to detectsimilar software applications across different programming lan-guages. In our approach, we use the API documentation tofind relationships among the API calls used by the differentprogramming languages. We adopt a deep learning based word-vector learning method to identify semantic relationships amongthe API documentation which we then use to detect cross-language similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. Weobserved thatCroLSimcan successfully detect similar softwareapplications across different programming languages with a meanaverage precision rate of 0.65, an average confidence rate of3.6 (out of 5) with 75% high rated successful queries, whichoutperforms all related existing approaches with a significantperformance improvement.
TL;DR: The model and implementation for cross translation unit symbolic execution for C family languages is described and the solution proved to be scalable to large codebases and the number of findings increased significantly for the evaluated projects.
Abstract: Static analysis is a great approach to find bugs and code smells. Some of the errors span across multiple translation units. Unfortunately, separate compilation makes cross translation unit analysis challenging for C family languages. In this paper, we describe a model and an implementation for cross translation unit symbolic execution for C family languages. We were able to extend the scope of the analysis without modifying any of the existing checkers. The analysis is implemented in the open source Clang compiler. We also measured the performance of the approach and the quality of the reports. The solution proved to be scalable to large codebases and the number of findings increased significantly for the evaluated projects. The implementation is already accepted into mainline Clang.