TL;DR: The goal was to simulate a practicing programmer in a program maintenance environment using the techniques of program design adapted to program understanding and documentation; that is, given a program, a specification and correctness proof were developed for the program.
Abstract: This paper reports on an experiment in trying to understand an unfamiliar program of some complexity and to record the authors' understanding of it. The goal was to simulate a practicing programmer in a program maintenance environment using the techniques of program design adapted to program understanding and documentation; that is, given a program, a specification and correctness proof were developed for the program. The approach points out the value of correctness proof ideas in guiding the discovery process. Toward this end, a variety of techniques were used: direct cognition for smaller parts, discovering and verifying loop invariants for larger program parts, and functions determined by additional analysis for larger program parts. An indeterminate bounded variable was introduced into the program documentation to summarize the effect of several program variables and simplify the proof of correctness.
TL;DR: This work introduces a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers and demonstrates its competitiveness with state-of-the-art handcrafted tools.
Abstract: Designing and crafting a static program analysis is challenging due to the complexity of the task at hand. Among the challenges are modelling the semantics of the input language, finding suitable abstractions for the analysis, and handwriting efficient code for the analysis in a traditional imperative language such as C++. Hence, the development of static program analysis tools is costly in terms of development time and resources for real world languages. To overcome, or at least alleviate the costs of developing a static program analysis, Datalog has been proposed as a domain specific language (DSL). With Datalog, a designer expresses a static program analysis in the form of a logical specification. While a domain specific language approach aids in the ease of development of program analyses, it is commonly accepted that such an approach has worse runtime performance than handcrafted static analysis tools. In this work, we introduce a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers. The synthesis technique requires the re-interpretation of the semi-naive evaluation as a scaffolding for translation using partial evaluation. To achieve high-performance, we employ staged-compilation techniques and specialize the underlying relational data structures for a given Datalog specification. Experimentation on benchmarks for large-scale program analysis validates the superior performance of our approach over available Datalog tools and demonstrates our competitiveness with state-of-the-art handcrafted tools.
TL;DR: DECAF is presented, a virtual machine based, multi-target, whole-system dynamic binary analysis framework built on top of QEMU, which provides Just-In-Time Virtual Machine Introspection combined with a novel TCG instruction-level tainting at bit granularity, backed by a plugin based, simple-to-use event driven programming interface.
Abstract: Dynamic binary analysis is a prevalent and indispensable technique in program analysis. While several dynamic binary analysis tools and frameworks have been proposed, all suffer from one or more of: prohibitive performance degradation, semantic gap between the analysis code and the program being analyzed, architecture/OS specificity, being user-mode only, lacking APIs, etc. We present DECAF, a virtual machine based, multi-target, whole-system dynamic binary analysis framework built on top of QEMU. DECAF provides Just-In-Time Virtual Machine Introspection combined with a novel TCG instruction-level tainting at bit granularity, backed by a plugin based, simple-to-use event driven programming interface. DECAF exercises fine control over the TCG instructions to accomplish on-the-fly optimizations. We present 3 platform-neutral plugins - Instruction Tracer, Keylogger Detector, and API Tracer, to demonstrate the ease of use and effectiveness of DECAF in writing cross-platform and system-wide analysis tools. Implementation of DECAF consists of 9550 lines of C++ code and 10270 lines of C code and we evaluate DECAF using CPU2006 SPEC benchmarks and show average overhead of 605% for system wide tainting and 12% for VMI.
TL;DR: The traditional software architecture for compilers is revised to provide these features without unnecessarily complicating the analyses themselves, and the user is allowed to selectively trade off time for precision and to customize the termination of these costly analyses in order to provide finer user control.
Abstract: Building efficient tools for understanding large software systems is difficult. Many existing program understanding tools build control flow and data flow representations of the program a priori, and therefore may require prohibitive space and time when analyzing large systems. Since much of these representations may be unused during an analysis, we construct representations on demand, not in advance. Furthermore, some representations, such as the abstract syntax tree, may be used infrequently during an analysis. We discard these representations and recompute them as needed, reducing the overall space required. Finally, we permit the user to selectively trade off time for precision and to customize the termination of these costly analyses in order to provide finer user control. We revised the traditional software architecture for compilers to provide these features without unnecessarily complicating the analyses themselves. These techniques have been successfully applied in the design of a program slicer for the Comprehensive Health Care System (CHCS), a million line hospital management system written in the MUMPS programming language.
TL;DR: This analysis relies on unification-based type inference in a non-standard type system, using rows to approximate both the flow of escaping exceptions and theflow of result values, and the resulting analysis is efficient and precise.
Abstract: This paper presents a program analysis to estimate uncaught exceptions in ML programs. This analysis relies on unification-based type inference in a non-standard type system, using rows to approximate both the flow of escaping exceptions (a la effect systems) and the flow of result values (a la control-flow analyses). The resulting analysis is efficient and precise; in particular, arguments carried by exceptions are accurately handled.