TL;DR: A compositional shape analysis, where each procedure is analyzed independently of its callers, based on a generalized form of abduction (inference of explanatory hypotheses) which is the basis of a new interprocedural analysis algorithm.
Abstract: This paper describes a compositional shape analysis, where each procedure is analyzed independently of its callers. The analysis uses an abstract domain based on a restricted fragment of separation logic, and assigns a collection of Hoare triples to each procedure; the triples provide an over-approximation of data structure usage. Compositionality brings its usual benefits -- increased potential to scale, ability to deal with unknown calling contexts, graceful way to deal with imprecision -- to shape analysis, for the first time.The analysis rests on a generalized form of abduction (inference of explanatory hypotheses) which we call bi-abduction. Bi-abduction displays abduction as a kind of inverse to the frame problem: it jointly infers anti-frames (missing portions of state) and frames (portions of state not touched by an operation), and is the basis of a new interprocedural analysis algorithm. We have implemented our analysis algorithm and we report case studies on smaller programs to evaluate the quality of discovered specifications, and larger programs (e.g., an entire Linux distribution) to test scalability and graceful imprecision.
TL;DR: A survey of some of the new research trends in symbolic execution, with particular emphasis on applications to test generation and program analysis, and an approach that handles complex programming constructs such as input recursive data structures, arrays, as well as multithreading.
Abstract: Symbolic execution is a well-known program analysis technique which represents program inputs with symbolic values instead of concrete, initialized, data and executes the program by manipulating program expressions involving the symbolic values. Symbolic execution has been proposed over three decades ago but recently it has found renewed interest in the research community, due in part to the progress in decision procedures, availability of powerful computers and new algorithmic developments. We provide here a survey of some of the new research trends in symbolic execution, with particular emphasis on applications to test generation and program analysis. We first describe an approach that handles complex programming constructs such as input recursive data structures, arrays, as well as multithreading. Furthermore, we describe recent hybrid techniques that combine concrete and symbolic execution to overcome some of the inherent limitations of symbolic execution, such as handling native code or availability of decision procedures for the application domain. We follow with a discussion of techniques that can be used to limit the (possibly infinite) number of symbolic configurations that need to be analyzed for the symbolic execution of looping programs. Finally, we give a short survey of interesting new applications, such as predictive testing, invariant inference, program repair, analysis of parallel numerical programs and differential symbolic execution.
TL;DR: It is argued that IFC must better exploit modern program analysis technology, and an approach based on program dependence graphs (PDG) is presented, which is more precise and needs less annotations than traditional approaches.
Abstract: Information flow control (IFC) checks whether a program can leak secret data to public ports, or whether critical computations can be influenced from outside. But many IFC analyses are imprecise, as they are flow-insensitive, context-insensitive, or object-insensitive; resulting in false alarms. We argue that IFC must better exploit modern program analysis technology, and present an approach based on program dependence graphs (PDG). PDGs have been developed over the last 20 years as a standard device to represent information flow in a program, and today can handle realistic programs. In particular, our dependence graph generator for full Java bytecode is used as the basis for an IFC implementation which is more precise and needs less annotations than traditional approaches. We explain PDGs for sequential and multi-threaded programs, and explain precision gains due to flow-, context-, and object-sensitivity. We then augment PDGs with a lattice of security levels and introduce the flow equations for IFC. We describe algorithms for flow computation in detail and prove their correctness. We then extend flow equations to handle declassification, and prove that our algorithm respects monotonicity of release. Finally, examples demonstrate that our implementation can check realistic sequential programs in full Java bytecode.
TL;DR: DOOP framework for points-to analysis of Java programs using Datalog for declarative specification and optimization.
Abstract: We present the DOOP framework for points-to analysis of Java programs. DOOP builds on the idea of specifying pointer analysis algorithms declaratively, using Datalog: a logic-based language for defining (recursive) relations. We carry the declarative approach further than past work by describing the full end-to-end analysis in Datalog and optimizing aggressively using a novel technique specifically targeting highly recursive Datalog programs.
TL;DR: This paper presents a new interprocedural, flow-sensitive pointer analysis algorithm that combines two ideas-semi-sparse analysis and a novel use of BDDs-that arise from a careful understanding of the unique challenges that face flow- sensitive pointer analysis.
Abstract: Pointer analysis is a prerequisite for many program analyses, and the effectiveness of these analyses depends on the precision of the pointer information they receive. Two major axes of pointer analysis precision are flow-sensitivity and context-sensitivity, and while there has been significant recent progress regarding scalable context-sensitive pointer analysis, relatively little progress has been made in improving the scalability of flow-sensitive pointer analysis.This paper presents a new interprocedural, flow-sensitive pointer analysis algorithm that combines two ideas-semi-sparse analysis and a novel use of BDDs-that arise from a careful understanding of the unique challenges that face flow-sensitive pointer analysis. We evaluate our algorithm on 12 C benchmarks ranging from 11K to 474K lines of code. Our fastest algorithm is on average 197x faster and uses 4.6x less memory than the state of the art, and it can analyze programs that are an order of magnitude larger than the previous state of the art.
TL;DR: This paper explores techniques for supporting dynamic updates to multi-threaded programs, focusing on the problem of applying an update in a timely fashion while still producing correct behavior, and can safely perform a dynamic update within milliseconds when more straightforward alternatives would delay some updates indefinitely.
Abstract: Many dynamic updating systems have been developed that enable a program to be patched while it runs, to fix bugs or add new features This paper explores techniques for supporting dynamic updates to multi-threaded programs, focusing on the problem of applying an update in a timely fashion while still producing correct behavior Past work has shown that this tension of safety versus timeliness can be balanced for single-threaded programs For multi-threaded programs, the task is more difficult because myriad thread interactions complicate understanding the possible program states to which a patch could be applied Our approach allows the programmer to specify a few program points (eg, one per thread) at which a patch may be applied, which simplifies reasoning about safety To improve timeliness, a combination of static analysis and run-time support automatically expands these few points to many more that produce behavior equivalent to the originals Experiments with thirteen realistic updates to three multi-threaded servers show that we can safely perform a dynamic update within milliseconds when more straightforward alternatives would delay some updates indefinitely
TL;DR: A privacy-protection framework is proposed that partitions a genomic computation, distributing the part on sensitive data to the data provider and the parts on the pubicData to the user of the data through program specialization.
Abstract: In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it is shared across humans (and hence public), and consequently relatively little of it is sensitive. Based on this observation, we propose a privacy-protection framework that partitions a genomic computation, distributing the part on sensitive data to the data provider and the part on the pubic data to the user of the data. Such a partition is achieved through program specialization that enables a biocomputing program to perform a concrete execution on public data and a symbolic execution on sensitive data. As a result, the program is simplified into an efficient query program that takes only sensitive genetic data as inputs. We prove the effectiveness of our techniques on a set of dynamic programming algorithms common in genomic computing. We develop a program transformation tool that automatically instruments a legacy program for specialization operations. We also demonstrate that our techniques can greatly facilitate secure multi-party computations on large biocomputing problems.
TL;DR: This work presents a decision procedure that solves systems of equations over regular language variables over systems of constraints, and finds satisfying assignments for the variables in the system.
Abstract: Reasoning about string variables, in particular program inputs, is an important aspect of many program analyses and testing frameworks Program inputs invariably arrive as strings, and are often manipulated using high-level string operations such as equality checks, regular expression matching, and string concatenation It is difficult to reason about these operations because they are not well-integrated into current constraint solversWe present a decision procedure that solves systems of equations over regular language variables Given such a system of constraints, our algorithm finds satisfying assignments for the variables in the system We define this problem formally and render a mechanized correctness proof of the core of the algorithm We evaluate its scalability and practical utility by applying it to the problem of automatically finding inputs that cause SQL injection vulnerabilities
TL;DR: The ROSE source-to-source outliner is presented, which addresses the problem of extracting tunable kernels out of whole programs, thereby helping to convert the challenging whole-program tuning problem into a set of more manageable kernel tuning tasks.
Abstract: Although automated empirical performance optimization and tuning is well-studied for kernels and domain-specific libraries, a current research grand challenge is how to extend these methodologies and tools to significantly larger sequential and parallel applications. In this context, we present the ROSE source-to-source outliner, which addresses the problem of extracting tunable kernels out of whole programs, thereby helping to convert the challenging whole-program tuning problem into a set of more manageable kernel tuning tasks. Our outliner aims to handle large scale C/C++, Fortran and OpenMP applications. A set of program analysis and transformation techniques are utilized to enhance the portability, scalability, and interoperability of source-to-source outlining. More importantly, the generated kernels preserve performance characteristics of tuning targets and can be easily handled by other tools. Preliminary evaluations have shown that the ROSE outliner serves as a key component within an end-to-end empirical optimization system and enables a wide range of sequential and parallel optimization opportunities.
TL;DR: The study reveals that large dependence clusters are surprisingly commonplace in C program source code, and most of the 45 programs studied have clusters of dependence that consume more than 10% of the whole program.
Abstract: A dependence cluster is a set of program statements, all of which are mutually inter-dependent. This article reports a large scale empirical study of dependence clusters in C program source code. The study reveals that large dependence clusters are surprisingly commonplace. Most of the 45 programs studied have clusters of dependence that consume more than 10p of the whole program. Some even have clusters consuming 80p or more. The widespread existence of clusters has implications for source code analyses such as program comprehension, software maintenance, software testing, reverse engineering, reuse, and parallelization.
TL;DR: In this paper, an annotation statement is used to express contracts at routine boundaries, allowing an analyzer to check the global correctness of the source code through modular (local) analysis, with performance that is linear in the number of routines.
Abstract: Program source code is annotated to support dataflow analysis or other program analysis, without requiring changes to compilers. Annotation statements are embedded inside comments or other non-code-generative portions of the source code. The annotations can be used to express contracts at routine boundaries, allowing an analyzer to check the global correctness of the source code through modular (local) analysis, with performance that is linear in the number of routines. In particular, annotated SQL source code may be analyzed to identify SQL injection vulnerabilities.
TL;DR: In this paper, the authors propose a typecheckable ownership domain annotations that enable a points-to static analysis to extract a sound global object graph that provides architectural abstraction by ownership hierarchy and by types where architecturally significant objects appear near the top of the hierarchy and data structures are further down.
Abstract: An object diagram makes explicit the object structures that are only implicit in a class diagram. An object diagram may be missing and must extracted from the code. Alternatively, an existing diagram may be inconsistent with the code, and must be analyzed for conformance with the implementation. One can generalize the global object diagram of a system into a runtime architecture which abstracts objects into components, represents how those components interact, and can decompose a component into a nested sub-architecture.A static object diagram represents all objects and inter-object relations possibly created, and is recovered by static analysis of a program. Existing analyses extract static object diagrams that are non-hierarchical, do not scale, and do not provide meaningful architectural abstraction. Indeed, architectural hierarchy is not readily observable in arbitrary code. Previous approaches used breaking language extensions to specify hierarchy and instances in code, or used dynamic analyses to extract dynamic object diagrams that show objects and relations for a few program runs.Typecheckable ownership domain annotations use existing language support for annotations and specify in code object encapsulation, logical containment and architectural tiers. These annotations enable a points-to static analysis to extract a sound global object graph that provides architectural abstraction by ownership hierarchy and by types, where architecturally significant objects appear near the top of the hierarchy and data structures are further down.Another analysis can abstract an object graph into a built runtime architecture. Then, a third analysis can compare the built architecture to a target, analyze and measure their structural conformance, establish traceability between the two and identify interesting differences.
TL;DR: This work uses program analysis and code generation to map the applications to a GPU and finds that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms.
Abstract: Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface.In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system.We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements.
TL;DR: This paper introduces a novel approach in which transformations are used to improve testability of a program by generating a pseudo-oracle and shows that both random testing and genetic algorithms are capable of utilizing the pseudo- oracles to automatically find program failures.
Abstract: Testability transformations are source-to-source program transformations that are designed to improve the testability of a program. This paper introduces a novel approach in which transformations are used to improve testability of a program by generating a pseudo-oracle. A pseudo-oracle is an alternative version of a program under test whose output can be compared with the original. Differences in output between the two programs may indicate a fault in the original program. Two transformations are presented. The first can highlight numerical inaccuracies in programs and cumulative roundoff errors, whilst the second may detect the presence of race conditions in multi-threaded code. Once a pseudo-oracle is generated, techniques are applied from the field of search-based testing to automatically find differences in output between the two versions of the program. The results of an experimental study presented in the paper show that both random testing and genetic algorithms are capable of utilizing the pseudo-oracles to automatically find program failures. Using genetic algorithms it is possible to explicitly maximize the discrepancies between the original programs and their pseudo-oracles. This allows for the production of test cases where the observable failure is highly pronounced, enabling the tester to establish the seriousness of the underlying fault.
TL;DR: The design and application of DDT are presented, a new program analysis tool that automatically identifies data structures within an application that is highly accurate across several different implementations of standard data structures, enabling aggressive optimizations in many situations.
Abstract: Data structures define how values being computed are stored and accessed within programs. By recognizing what data structures are being used in an application, tools can make applications more robust by enforcing data structure consistency properties, and developers can better understand and more easily modify applications to suit the target architecture for a particular application. This paper presents the design and application of DDT, a new program analysis tool that automatically identifies data structures within an application. An application binary is instrumented to dynamically monitor how the data is stored and organized for a set of sample inputs. The instrumentation detects which functions interact with the stored data, and creates a signature for these functions using dynamic invariant detection. The invariants of these functions are then matched against a library of known data structures, providing a probable identification. That is, DDT uses program consistency properties to identify what data structures an application employs. The empirical evaluation shows that this technique is highly accurate across several different implementations of standard data structures, enabling aggressive optimizations in many situations.
TL;DR: This work describes the principal problems of adjoining the most frequently used communication idioms of MPI programs and proposes solutions to cover these idioms, and considers the consequences for the MPI implementation, theMPI user and MPI-aware program analysis.
Abstract: Automatic differentiation is the primary means of obtaining analytic derivatives from a numerical model given as a computer program. Therefore, it is an essential productivity tool in numerous computational science and engineering domains. Computing gradients with the adjoint (also called reverse) mode via source transformation is a particularly beneficial but also challenging use of automatic differentiation. To date only ad hoc solutions for adjoint differentiation of MPI programs have been available, forcing automatic differentiation tool users to reason about parallel communication dataflow and dependencies and manually develop adjoint communication code. Using the communication graph as a model we characterize the principal problems of adjoining the most frequently used communication idioms. We propose solutions to cover these idioms and consider the consequences for the MPI implementation, the MPI user and MPI-aware program analysis. The MIT general circulation model serves as a use case to illustrate the viability of our approach.
TL;DR: In this article, a semantic view of program executions is proposed to capture semantic interactions among abstractions at different levels of granularity in a scalable manner, which can identify precise and useful details about the underlying cause for a regression, even in programs that use reflection, multithreading or dynamic code generation.
Abstract: As computer systems continue to become more powerful and complex, so do programs. High-level abstractions introduced to deal with complexity in large programs, while simplifying human reasoning, can often obfuscate salient program properties gleaned from automated source-level analysis through subtle (often non-local) interactions. Consequently, understanding the effects of program changes and whether these changes violate intended protocols become difficult to infer. Refactorings, and feature additions, modifications, or removals can introduce hard-to-catch bugs that often go undetected until many revisions later.To address these issues, this paper presents a novel dynamic program analysis that builds a semantic view of program executions. These views reflect program abstractions and aspects; however, views are not simply projections of execution traces, but are linked to each other to capture semantic interactions among abstractions at different levels of granularity in a scalable manner.We describe our approach in the context of Java and demonstrate its utility to improve regression analysis. We first formalize a subset of Java and a grammar for traces generated at program execution. We then introduce several types of views used to analyze regression bugs along with a novel, scalable technique for semantic differencing of traces from different versions of the same program. Benchmark results on large open-source Java programs demonstrate that semantic-aware trace differencing can identify precise and useful details about the underlying cause for a regression, even in programs that use reflection, multithreading, or dynamic code generation, features that typically confound other analysis techniques.
TL;DR: The majority of vulnerabilities that affect web applications can be ascribed to the lack of proper validation of user's input, before it is used as argument of an output function.
Abstract: Increasingly, web applications handle sensitive data and interface with critical back-end components, but are often written by poorly experienced programmers with low security skills. The majority of vulnerabilities that affect web applications can be ascribed to the lack of proper validation of user's input, before it is used as argument of an output function. Several program analysis techniques were proposed to automatically spot these vulnerabilities. One particularly effective is dynamic taint analysis. Unfortunately, this approach introduces a significant run-time penalty. In this paper, we present a hybrid analysis framework that blends together the strengths of static and dynamic approaches for the detection of vulnerabilities in web applications: a static analysis, performed just once, is used to reduce the run-time overhead of the dynamic monitoring phase. We designed and implemented a tool, called Phan, that is able to statically analyze PHP bytecode searching for dangerous code statements; then, only these statements are monitored during the dynamic analysis phase.
TL;DR: This paper presents CPAchecker, a tool and framework that aims at easy integration of new verification components and evaluates the efficiency of the current version of the tool on software-verification benchmarks from the literature, and compares it with other state-of-the-art model checkers.
Abstract: Configurable software verification is a recent concept for expressing different program analysis and model checking approaches in one single formalism. This paper presents CPAchecker, a tool and framework that aims at easy integration of new verification components. Every abstract domain, together with the corresponding operations, is required to implement the interface of configurable program analysis (CPA). The main algorithm is configurable to perform a reachability analysis on arbitrary combinations of existing CPAs. The major design goal during the development was to provide a framework for developers that is flexible and easy to extend. We hope that researchers find it convenient and productive to implement new verification ideas and algorithms using this platform and that it advances the field by making it easier to perform practical experiments. The tool is implemented in Java and runs as command-line tool or as Eclipse plug-in. We evaluate the efficiency of our tool on benchmarks from the software model checker BLAST. The first released version of CPAchecker implements CPAs for predicate abstraction, octagon, and explicit-value domains. Binaries and the source code of CPAchecker are publicly available as free software.
TL;DR: An integrated algorithms system is proposed to implement the RBDO of the offshore towers, which are subjected to the extreme wave loading, which includes a reliability analysis and an optimization algorithm.
TL;DR: An operational semantics for Declarative Networking is proposed that addresses the presence of asynchronous communication, distribution, and imperative modification of the program state and keeps open a design space required at the current stage of the language development.
Abstract: Declarative Networking has been recently promoted as a high-level programming paradigm to more conveniently describe and implement systems that run in a distributed fashion over a computer network. It has already been used to implement various networked systems, e.g., network overlays, Byzantine fault tolerance protocols, and distributed hash tables. Declarative Networking relies upon a rule-based programming language that resembles Datalog and allows one to declaratively specify the flow of networking events. However, the presence of asynchronous communication, distribution, and imperative modification of the program state in Declarative Networking applications have been an obstacle for defining its semantics. Currently, the reference semantics is determined by the runtime environment only, which hinders further application development and makes any efforts to develop program analysis and verification tools impossible. In this paper, we propose an operational semantics for Declarative Networking that addresses these problems. The semantics is parameterized to keep open a design space required at the current stage of the language development. We also report on our first experience with an interpreter for Declarative Networking applications that implements the proposed semantics.
TL;DR: An efficient algorithm for computing the most specific templates of terms represented by labelled directed acyclic graphs is presented and the complexity of the anti-unification problem is estimated.
Abstract: A term t is called a template of terms t1 and t2 iff t1=tη1 and t2=tη2, for some substitutions η1 and η2. A template t of t1 and t2 is called the most specific iff for any template t′ of t1 and t2 there exists a substitution ξ such that t=t′ξ. The anti-unification problem is that of computing the most specific template of two given terms. This problem is dual to the well-known unification problem, which is the computing of the most general instance of terms. Unification is used extensively in automatic theorem proving and logic programming. We believe that anti-unification algorithms may have wide applications in program analysis. In this paper we present an efficient algorithm for computing the most specific templates of terms represented by labelled directed acyclic graphs and estimate the complexity of the anti-unification problem. We also describe techniques for invariant generation and software clone detection based on the concepts of the most specific templates and anti-unification.
TL;DR: A comparison of the sophisticated approaches of Vampir NG (VNG), the Debugging Wizard DeWiz and the correctness-checking tool MARMOT provides novel ideas for scalable parallel program analysis.
Abstract: Large-scale high-performance computing systems pose a tough obstacle for today's program analysis tools. Their demands in computational performance and memory capacity for processing program analysis data exceed the capabilities of standard workstations and traditional analysis tools. A comparison of the sophisticated approaches of Vampir NG (VNG), the Debugging Wizard DeWiz and the correctness-checking tool MARMOT provides novel ideas for scalable parallel program analysis. While VNG exploits the power of cluster architectures for near real-time performance analysis, DeWiz utilises distributed computing infrastructures for distinct analysis activities. MARMOT combines automatic runtime and partially distributed analysis.
TL;DR: This paper studies a counting version of SMT, i.e., how to compute the volume of the solution space, given a set of Boolean combinations of linear constraints, and proposes an improved algorithm and two ways of incorporating theory-level lemma learning technique into the algorithm.
Abstract: There are many works on the satisfiability problem for various logics and constraint languages, such as SAT and Satisfiability Modulo Theories (SMT). On the other hand, the counting version of decision problems is also quite important in automated reasoning. In this paper, we study a counting version of SMT, i.e., how to compute the volume of the solution space, given a set of Boolean combinations of linear constraints. The problem generalizes the model counting problem and the volume computation problem for convex polytopes. It has potential applications to program analysis and verification, as well as approximate reasoning, yet it has received little attention. We first give a straightforward method, and then propose an improved algorithm. We also describe two ways of incorporating theory-level lemma learning technique into the algorithm. They have been implemented, and some experimental results are given. Through an example program, we show that our tool can be used to compute how often a given program path is executed.
TL;DR: In this article, an analysis engine for static analysis using CEGAR loop functionality, using a combination of forward and backward validation-phase trace analyses, is described, which includes a number of features.
Abstract: An analysis engine is described for performing static analysis using CEGAR loop functionality, using a combination of forward and backward validation-phase trace analyses. The analysis engine includes a number of features. For example: (1) the analysis engine can operate on blocks of program statements of different adjustable sizes; (2) the analysis engine can identify a subtrace of the trace and perform analysis on that subtrace (rather than the full trace); (3) the analysis engine can form a pyramid of state conditions and extract predicates based on the pyramid and/or from auxiliary source(s); (4) the analysis engine can generate predicates using an increasingly-aggressive series of available discovery techniques; (5) the analysis engine can selectively concretize procedure calls associated with the trace on an as-needed basis and perform other refinements; and (6) the analysis engine can add additional verification targets in the course of its analysis, etc.
TL;DR: The aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
Abstract: Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases.
TL;DR: In this paper, the authors describe a verification system to verify a program by symbolically enumerating path programs, verifying each path program to determine if the path program is correct or leads to a violation of a correctness property.
Abstract: Systems and methods are disclosed to verify a program by symbolically enumerating path programs; verifying each path program to determine if the path program is correct or leads to a violation of a correctness property; determining a conflict set from the path program if the path program is proved correct; using the conflict set to avoid enumerating other related path programs that are also correct.
TL;DR: The effort to leverage domain-specific knowledge for optimizing symbolic execution (SymExe)-based analyses is presented, and optimization techniques incorporated as a lightweight semi-decision procedure (LDP) that provides up to an order of magnitude faster analysis time when analyzing realistic programs and well-known algorithms.
Abstract: Automated theorem proving techniques such as Satisfiability Modulo Theory (SMT) solvers have seen significant advances in the past several years. These advancements, coupled with vast hardware improvements, have drastic impact on, for example, program verification techniques and tools. The general availability of robust general purpose solvers have reduced a significant engineering overhead when designing and developing program verifiers. However, most solver implementations are designed to be used as a black box, and due to their aim as general purpose solvers, they often miss optimization opportunities that can be done by leveraging domain-specific knowledge. This paper presents our effort to leverage domain-specific knowledge for optimizing symbolic execution (SymExe)-based analyses; we present optimization techniques incorporated as a lightweight semi-decision procedure (LDP) that provides up to an order of magnitude faster analysis time when analyzing realistic programs and well-known algorithms. LDP sits in the middle between a SymExe-based analysis tool and an existing SMT solver; it aims to reduce the number of solver calls by intercepting them and attempting to solve constraints using its lightweight deductive engine.
TL;DR: A novel technique is presented to create a system in which, for the cost of writing just one specification, an interpreter for the programming language of interest obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives.
Abstract: The paper presents a novel technique to create implementations of the basic primitives used in symbolic program analysis: forward symbolic evaluation , weakest liberal precondition , and symbolic composition . We used the technique to create a system in which, for the cost of writing just one specification--an interpreter for the programming language of interest--one obtains automatically-generated, mutually-consistent implementations of all three symbolic-analysis primitives. This can be carried out even for languages with pointers and address arithmetic. Our implementation has been used to generate symbolic-analysis primitives for x86 and PowerPC.
TL;DR: Experimental results show that the proposed birthmark is more reliable than previous methods for detecting programs that are suspected to be copied and compared with previous approaches in terms of credibility and resilience.
Abstract: A software birthmark is an inherent characteristic of a program that can be used to identify that program. By comparing the birthmarks of two programs, it is possible to infer if one program is a copy of another. In this paper, we propose a static birthmark based on the control flow edges in Java programs. Control flow edges can represent possible behaviors in program execution. Thus, a set of the control flow edges of a program can be used as a birthmark for that program. The similarity between two programs can then be calculated by finding pairs of similar behaviors of the control flow edges in the two birthmarks. The proposed birthmark is evaluated and compared with previous approaches in terms of credibility and resilience. Experimental results show that the proposed birthmark is more reliable than previous methods for detecting programs that are suspected to be copied.