TL;DR: By analysing method implementations taken from a corpus of Java applications, an automatically generated, domain-neutral lexicon of verbs, similar to a natural language dictionary, that represents the common usages of many programmers is established.
Abstract: Method names make or break abstractions: good ones communicate the intention of the method, whereas bad ones cause confusion and frustration. The task of naming is subject to the whims and idiosyncracies of the individual since programmers have little to guide them except their personal experience. By analysing method implementations taken from a corpus of Java applications, we establish the meaning of verbs in method names based on actual use. The result is an automatically generated, domain-neutral lexicon of verbs, similar to a natural language dictionary, that represents the common usages of many programmers.
TL;DR: This work presents an approach to tool-based analysis of the quality of systems that employ embedded SQL queries, and defines a suite of metric extractors for embedded queries.
Abstract: The access of information systems to underlying relational databases is commonly programmed using embedded SQL queries. Such embedded queries may take the form of string literals that are programmatically concatenated into queries to be submitted to the DBMS, or they may be written in a mixture of the syntax of SQL and a host programming language. The particular ways in which embedded queries are constructed and intertwined with the surrounding code can have significant impact on the understandability, testability, adaptability, and other quality aspects of the overall system. We present an approach to tool-based analysis of the quality of systems that employ embedded SQL queries. The basis of the approach is the identification and reconstruction of embedded queries. These queries are then submitted to a variety of analyses. For example, we chart the relationships of queries to the surrounding code and, via the control and data flow of that code, to each other. Also, we define a suite of metric extractors for embedded queries. Through a number of case studies, involving PL/SQL, Cobol, and Visual Basic, we show how the results of these analyses can be employed to make an assessment of various quality aspects related to the use of embedded queries.
TL;DR: The approach is used to improve the syntactic identification of violations to Deissenbock and Pizka's rules for concise and consistent identifier construction and evaluates the process on a code based of just over 35 million lines of code.
Abstract: Informative identifiers are made up of full (natural language) words and (meaningful) abbreviations. Readers of programs typically have little trouble understanding the purpose of identifiers composed of full words. In addition, those familiar with the code can (most often) determine the meaning of abbreviations used in identifiers. However, when faced with unfamiliar code, abbreviations often carry little useful information. Furthermore, tools that focus on the natural language used in the code have a hard time in the presence of abbreviations. One approach to providing meaning to programmers and tools is to translate (expand) abbreviations into full words. This paper presents a methodology for expanding identifiers and evaluates the process on a code based of just over 35 million lines of code. For example, using phrase extraction, fs_exists is expanded to file_status_exists illustrating how the expansion process can facilitate comprehension. On average, 16 percent of the identifiers in a program are expanded. Finally, as an example application, the approach is used to improve the syntactic identification of violations to Deissenbock and Pizka's rules for concise and consistent identifier construction.
TL;DR: This paper proposes a data mining framework to help researchers cope with the large amount of data produced by clone detection tools, and proposes techniques to reduce, abstract and highlight the most interesting data produced in this paper.
Abstract: Clones are code segments that have been created by copying-and-pasting from other code segments. Clones occur often in large software systems. It is reported that 5 to 50% of the source code of a large software system is cloned. A major challenge when studying code cloning in large software systems is handling the large amount of clone candidates produced by leading edge clone detection tools. For example, the CCFinder, clone detection tool, produces over 7 million pairs of clone candidates for the Linux kernel (which consists of over 4MLOC). Moreover, the output of clone detection tools grows rapidly as a software system evolves. Researchers and developers need tools to help them study the large amount of clone data in order to better understand the clone phenomena in large systems. In this paper, we propose a data mining framework to help researchers cope with the large amount of data produced by clone detection tools. We propose techniques to reduce, abstract and highlight the most interesting data produced by clone detection tools. Our framework also introduces a visualization tool which allows users to query and explore clone data at various abstraction levels. We demonstrate our framework on a case study of the clone phenomena in the Linux kernel.
TL;DR: This paper presents a generic bytecode instrumentation framework that goes beyond restrictions and enables the customized, dynamic instrumentation of all classes in pure Java, and addresses important issues, such as bootstrapping an instrumented JDK, as well as avoiding measurement perturbations due toynamic instrumentation or execution of instrumentation code.
Abstract: Java bytecode instrumentation is a widely used technique, especially for profiling purposes. In order to ensure the instrumentation of all classes in the system, including dynamically generated or downloaded code, instrumentation has to be performed at runtime. The standard JDK offers some mechanisms for dynamic instrumentation, which however either require the use of native code or impose severe restrictions on the instrumentation of certain core classes of the JDK. These limitations prevent several instrumentation techniques that are important for efficient, calling context-sensitive profiling. In this paper we present a generic bytecode instrumentation framework that goes beyond these restrictions and enables the customized, dynamic instrumentation of all classes in pure Java. Our framework addresses important issues, such as bootstrapping an instrumented JDK, as well as avoiding measurement perturbations due to dynamic instrumentation or execution of instrumentation code. We validated and evaluated our framework using an instrumentation for exact profiling which generates complete calling context trees of various platform-independent dynamic metrics.
TL;DR: A systematic strategy for migrating crosscutting concerns in existing object-oriented systems to aspect-based solutions and is made available as an open-source project, which is the largest aspect refactoring available to date.
Abstract: In this paper we propose a systematic strategy for migrating crosscutting concerns in existing object-oriented systems to aspect-based solutions. The proposed strategy consists of four steps: mining, exploration, documentation and refactoring of crosscutting concerns. We discuss in detail a new approach to aspect refactoring that is fully integrated with our strategy, and apply the whole strategy to an object-oriented system, namely the JHotDrAW framework. The result of this migration is made available as an open-source project, which is the largest aspect refactoring available to date. We report on our experiences with conducting this case study and reflect on the success and challenges of the migration process, as well as on the feasibility of automatic aspect refactoring.
TL;DR: This work uses an automated program analysis, called Reach, to compute program inputs that cause evaluation of explicitly-marked target expressions, based on lazy narrowing, a symbolic evaluation strategy from functional-logic programming.
Abstract: We present an automated program analysis, called Reach, to compute program inputs that cause evaluation of explicitly-marked target expressions. Reach has a range of applications including property refutation, assertion breaking, program crashing, program covering, program understanding, and the development of customised data generators. Reach is based on lazy narrowing, a symbolic evaluation strategy from functional-logic programming. We use Reach to analyse a range of programs, and find it to be a useful tool with clear performance benefits over a method based on exhaustive input generation. We also explore different methods for bounding the search space, the selective use of breadth-first search to find the first solution quickly, and techniques to avoid evaluation that is unnecessary to reach a target.
TL;DR: The SQuAVisiT toolset allows a fully automatic extraction of metrics, call information, and code duplication from COBOL source code, and can be easily converted and exported to a set of visualization tools.
Abstract: Software quality assessment of large COBOL industrial legacy systems, both for maintenance or migration purposes, mounts a serious challenge. We present the software quality assessment and visualisation toolset (SQuAVisiT), which assists users in performing the above task. First, it allows a fully automatic extraction of metrics, call information, and code duplication from COBOL source code. This information, stored into a database, can be easily converted and exported to a set of visualization tools. We incorporated several such third-party tools for the visualization of call relations and system structure, and metrics visualization. These tools use novel visualization techniques such as bundled edges, matrix plots, and table lens. We illustrate the usage of our toolset with an industrial case study on a COBOL system comprising about 3000 modules and 1.7 million lines of code.
TL;DR: Using a pool of 48 million lines of code, experiments with the resulting syntactic rules for well-formed identifiers illustrate that violations of the syntactic pattern exist and that three-quarters of these violations are ‘real’ and could be identified using a concept mapping.
TL;DR: It is proposed that a barrier slice is significantly smaller than the corresponding backward slice while providing the same level of protection against malicious modification of the client code, based on the replication of a portion of theclient on the server.
Abstract: Remote trusting aims at verifying the "healthy" execution of a program running on an untrusted client that communicates with a trusted server via network connection After giving a formal definition of the remote trusting problem and a test to determine whether an attack against a given remote trusting scheme is successful or not, we propose a protection against malicious modification of the client code, based on the replication of a portion of the client on the server To minimize the size of the code that is replicated, we propose to use barrier slicing We show the feasibility of our approach on a case study Our results indicate that a barrier slice is significantly smaller than the corresponding backward slice while providing the same level of protection
TL;DR: This paper presents a new algorithm that performs this transformation in a semi-automated way on Java programs and state the difficulties inherent to this transformation and propose solutions to handle them.
Abstract: This paper presents an implementation of the "form template method" refactoring. This transformation has not been automated yet, but has many similarities with other transformations such as clone detection and removal or method extraction. Forming a template method is a difficult process because it has to deal with code statements directly. Few abstractions and algorithms have been investigated yet, compared to transformations dealing with higher level aspects such as the classes, methods, fields and their relations. We present a new algorithm that performs this transformation in a semi-automated way on Java programs. We state the difficulties inherent to this transformation and propose solutions to handle them.
TL;DR: This work proposes a series of evaluation algorithms for collection attributes, and shows that the best algorithms work well on large practical problems, including the analysis of large Java programs.
Abstract: Collection attributes, as defined by Boyland, can be used as a mechanism for concisely specifying cross-reference like properties such as callee sets, subclass sets, and sets of variable uses. We have implemented collection attributes in our declarative meta programming system JastAdd, and used them for a variety of applications including devitalization analysis, metrics, and flow analysis. We propose a series of evaluation algorithms for collection attributes, and compare their performance and applicability. The key design criteria for our algorithms are 1) that they work well with demand evaluation, i.e., defined properties are computed only if they are actually needed for a particular source code analysis problem and a particular source program, and 2) that they work in the presence of circular (fixed-point) definitions that are common for many source code analysis problems, e.g., flow analysis. We show that the best algorithms work well on large practical problems, including the analysis of large Java programs.
TL;DR: The empirical study over multiple iterations of various software packages developed using highly iterative or agile methods concludes that the Bansiya and Davis total quality index does indeed reflect stability over the data sets examined.
Abstract: The purpose of our study is to analyze whether the Bansiya and Davis quality models also reflect the ongoing stability of a software design in software developed using a highly iterative or agile process. We performed an empirical study over multiple iterations of various software packages developed using highly iterative or agile methods. We examined several Bansiya and Davis quality factor models (reusability, flexibility, understandability, functionality, extendibility, and effectiveness) over this data set and we compared them to stability metrics. We conclude that the Bansiya and Davis total quality index does indeed reflect stability over the data sets examined.
TL;DR: This work presents statement-level cohesion metrics based on slices and chops that can show which parts of a module have a low cohesion and thus help the maintainer to identify the parts that should be restructured.
Abstract: Slice-based metrics for cohesion have been defined and examined for years. However, if a module with low cohesion has been identified, the metrics cannot help the maintainer to restructure the module to improve the cohesion. This work presents statement-level cohesion metrics based on slices and chops. When visualized, the statement-level cohesion metrics can show which parts of a module have a low cohesion and thus help the maintainer to identify the parts that should be restructured.
TL;DR: This work shows an approach for keeping control flow related information even in sparse program representations by representing control flow effects as operations on the data transferred, i.e., as dataflow information, thus yielding a certain degree of path-sensitivity.
Abstract: Points-to analysis is a static program analysis aiming at analyzing the reference structure of dynamically allocated objects at compile-time. It constitutes the basis for many analyses and optimizations in software engineering and compiler construction. Sparse program representations, such as Whole Program Points-to Graph (WPP2G) and Points-to SSA (P2SSA), represent only dataflow that is directly relevant for points-to analysis. They have proved to be practical in terms of analysis precision and efficiency. However, intra-procedural control flow information is removed from these representations, which sacrifices analysis precision to improve analysis performance. We show an approach for keeping control flow related information even in sparse program representations by representing control flow effects as operations on the data transferred, i.e., as dataflow information. These operations affect distinct paths of the program differently, thus yielding a certain degree of path-sensitivity. Our approach works with both WPP2G and P2SSA representations. We apply the approach to P2SSA-based and flow-sensitive points-to analysis and evaluate a context-insensitive and a context-sensitive variant. We assess our approach using abstract precision metrics. Moreover, we investigate the precision improvements and performance penalties when used as an input to three source-code-level analyses: dead code, cast safety, and null pointer analysis.
TL;DR: This article proposes AVal: a Java5 framework for the definition and checking of rules for cOP in Java, and defines a set of meta-annotations to allow the validation of cOP programs, as well as the means to extend these meta- annotations by using a compile-time model of the program's source code.
TL;DR: A new data reachability algorithm is proposed and fine tuned to resolve library callbacks accurately and shows a significant reduction in the number of spurious callback edges.
TL;DR: This paper experimentally demonstrates how a spatial index can be used to substantially increase matching performance and introduces the novel idea of using information retrieval techniques for calculating the similarity of bags of program fragments.
Abstract: To encourage open source/libre software development, it is desirable to have tools that can help to identify open source license violations. This paper describes the implementation of a tool that matches open source programs embedded inside pirate programs. The problem of binary program matching can be approximated by analyzing the similarity of program fragments generated from low-level instructions. These fragments are syntax trees that can be compared by using a tree distance function. Tree distance functions are generally very costly. Sequentially calculating the similarities of fragments with them becomes prohibitively expensive. In this paper we experimentally demonstrate how a spatial index can be used to substantially increase matching performance. These techniques allowed us to do exhaustive experiments that confirmed previous results on the subject. The paper also introduces the novel idea of using information retrieval techniques for calculating the similarity of bags of program fragments. It is possible to identify programs even when they are heavily obfuscated with the innovative approach described here.
TL;DR: It is shown that one algorithm may produce incorrect slices and that precise slicing of concurrent programs is very expensive in terms of computation time.
Abstract: Program slicing is a program-reduction technique for extracting statements that may influence other statements. While there exist efficient algorithms to slice sequential programs precisely, there are only two algorithms for precise slicing of concurrent interprocedural programs with recursive procedures. We implemented both algorithms for Java, applied several new optimizations and examined their precision and runtime behavior. We compared these results with two further algorithms which trade precision for speed. We show that one algorithm may produce incorrect slices and that precise slicing of concurrent programs is very expensive in terms of computation time.
TL;DR: This work presents a novel semi-static approach, which combines static string analysis with dynamically gathered information about the execution environment, and proposes generalizations of string analysis to increase the number of sites that can be resolved purely statically, and to track the names of environment variables.
Abstract: Modern applications are becoming increasingly more dynamic and flexible. In Java software, one important flexibility mechanism is dynamic class loading. Unfortunately, the vast majority of static analyses for Java handle this feature either unsoundly or overly conservatively. We present a set of techniques for static resolution of dynamic-class-loading sites in Java software. Previous work has used static string analysis to achieve this goal. However, a large number of such sites are impossible to resolve with purely static techniques. We present a novel semi-static approach, which combines static string analysis with dynamically gathered information about the execution environment. The key insight behind this approach is the observation that dynamic class loading often depends on characteristics of the execution environment that are encoded in various environment variables. In addition, we propose generalizations of string analysis to increase the number of sites that can be resolved purely statically, and to track the names of environment variables. We present an experimental evaluation on 10,238 classes from the standard Java libraries. Our results show that a state- of-the-art purely static approach resolves only 28% of non-trivial sites, while our approach resolves more than twice as many sites. This work is a step towards making static analysis tools better equipped to handle the dynamic features of Java.
TL;DR: The effectiveness of SUDS is demonstrated by showing that it is capable of finding bugs and that performance is improved when static analysis is used to eliminated unnecessary instrumentation.
Abstract: SUDS is a powerful infrastructure for creating dynamic bug detection tools. It contains phases for both static analysis and dynamic instrumentation allowing users to create tools that take advantage of both paradigms. The results of static analysis phases can be used to improve the quality of dynamic bug detection tools created with SUDS and could be expanded to find defects statically. The instrumentation engine is designed in a manner that allows users to create their own correctness models quickly but is flexible to support construction of a wide range of different tools. The effectiveness of SUDS is demonstrated by showing that it is capable of finding bugs and that performance is improved when static analysis is used to eliminated unnecessary instrumentation.
TL;DR: This contribution introduces temporal path conditions, which extend ordinary path conditions by temporal operators in order to express temporal dependencies between conditions for a flow, and proves the following soundness property: if a temporal path condition for a path is satisfiable, then the ordinary boolean path conditionfor the path is satisfactory.
Abstract: Program dependence graphs are a well-established device to represent possible information flow in a program. Path conditions in dependence graphs have been proposed to express more detailed circumstances of a particular flow; they provide precise necessary conditions for information flow along a path or chop in a dependence graph. Ordinary Boolean path conditions however cannot express temporal properties, e.g. that for a specific flow it is necessary that some condition holds, and later another specific condition holds. In this contribution, we introduce temporal path conditions, which extend ordinary path conditions by temporal operators in order to express temporal dependencies between conditions for a flow. We present motivating examples, generation and simplification rules, application of model checking to generate witnesses for a specific flow, and a case study. We prove the following soundness property: if a temporal path condition for a path is satisfiable, then the ordinary boolean path condition for the path is satisfiable. The converse does not hold, indicating that temporal path conditions are more precise.
TL;DR: This tool-demo proposal explains the usage and benefits of the Reuseware Composition Framework by defining an extension of the Java language.
Abstract: The Reuseware Composition Framework is a tool- supported framework that aids developers of new composition techniques with integrating them into programming languages. In this tool-demo proposal we explain the usage and benefits of the framework by defining an extension of the Java language.
TL;DR: A new tool is presented which increases the level of understanding and the accuracy of design quality assessment within enterprise systems by providing its users with information specific to this type of systems.
Abstract: In the current demonstration we present a new tool which increases the level of understanding and the accuracy of design quality assessment within enterprise systems. This is performed by providing its users with information specific to this type of systems (e.g., accessed tables from a class). In order to validate its usefulness, we perform some experiments on a suite of enterprise systems whose results are briefly presented in the last part of the demo.