TL;DR: A novel semantic program embedding that is learned from program execution traces is proposed, showing that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model.
Abstract: Neural program embeddings have shown much promise recently for a variety of program analysis tasks, including program synthesis, program repair, fault localization, etc. However, most existing program embeddings are based on syntactic features of programs, such as raw token sequences or abstract syntax trees. Unlike images and text, a program has an unambiguous semantic meaning that can be difficult to capture by only considering its syntax(i.e. syntactically similar programs can exhibit vastly different run-time behavior), which makes syntax-based program embeddings fundamentally limited. This paper proposes a novel semantic program embedding that is learned from program execution traces. Our key insight is that program states expressed as sequential tuples of live variable values not only captures program semantics more precisely, but also offer a more natural fit for Recurrent Neural Networks to model. We evaluate different syntactic and semantic program embeddings on predicting the types of errors that students make in their submissions to an introductory programming class and two exercises on the CodeHunt education platform. Evaluation results show that our new semantic program embedding significantly outperforms the syntactic program embeddings based on token sequences and abstract syntax trees. In addition, we augment a search-based program repair system with the predictions obtained from our semantic embedding, and show that search efficiency is also significantly improved.
TL;DR: This paper presents componential set- based analysis, which is faster and handles larger programs without any loss of accuracy over set-based analysis.
Abstract: Set based analysis is a constraint-based whole program analysis that is applicable to functional and object-oriented programming language. Unfortunately, the analysis is useless for large programs, since it generates descriptions of data flow relationships that grow quadratically in the size of the program.This paper presents componential set-based analysis, which is faster and handles larger programs without any loss of accuracy over set-based analysis. The design of the analysis exploits a number of theoretical results concerning constraint systems, including a completeness result and a decision algorithm concerning the observable equivalance of constraint systems. Experimental results validate the practically of the analysis.
TL;DR: The method is implemented in LLVM and validated on a large set of applications, which are cryptographic libraries with 19,708 lines of C/C++ code in total, and ensures that the number of CPU cycles taken to execute any path is independent of the secret data.
Abstract: We propose a method, based on program analysis and transformation, for eliminating timing side channels in software code that implements security-critical applications. Our method takes as input the original program together with a list of secret variables (e.g., cryptographic keys, security tokens, or passwords) and returns the transformed program as output. The transformed program is guaranteed to be functionally equivalent to the original program and free of both instruction- and cache-timing side channels. Specifically, we ensure that the number of CPU cycles taken to execute any path is independent of the secret data, and the cache behavior of memory accesses, in terms of hits and misses, is independent of the secret data. We have implemented our method in LLVM and validated its effectiveness on a large set of applications, which are cryptographic libraries with 19,708 lines of C/C++ code in total. Our experiments show the method is both scalable for real applications and effective in eliminating timing side channels.
TL;DR: This paper addresses the issue of identifying buffer overrun vulnerabilities by statically analyzing C source code and demonstrates a light-weight analysis based on modeling C string manipulations as a linear program.
Abstract: This paper addresses the issue of identifying buffer overrun vulnerabilities by statically analyzing C source code. We demonstrate a light-weight analysis based on modeling C string manipulations as a linear program. We also present fast, scalable solvers based on linear programming, and demonstrate techniques to make the program analysis context sensitive. Based on these techniques, we built a prototype and used it to identify several vulnerabilities in popular security critical applications.
TL;DR: Algorithms for constructing PPDGs and applying them to fault diagnosis and preliminary evidence indicating that a PPDG-based fault localization technique compares favorably with existing techniques are presented.
Abstract: This paper presents an innovative model of a program's internal behavior over a set of test inputs, called the probabilistic program dependence graph (PPDG), which facilitates probabilistic analysis and reasoning about uncertain program behavior, particularly that associated with faults. The PPDG construction augments the structural dependences represented by a program dependence graph with estimates of statistical dependences between node states, which are computed from the test set. The PPDG is based on the established framework of probabilistic graphical models, which are used widely in a variety of applications. This paper presents algorithms for constructing PPDGs and applying them to fault diagnosis. The paper also presents preliminary evidence indicating that a PPDG-based fault localization technique compares favorably with existing techniques. The paper also presents evidence indicating that PPDGs can be useful for fault comprehension.