TL;DR: An intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependences for each operation in a program, allowing transformations to be triggered by one another and applied only to affected dependences.
Abstract: In this paper we present an intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependences for each operation in a program. Data dependences have been used to represent only the relevant data flow relationships of a program. Control dependences are introduced to analogously represent only the essential control flow relationships of a program. Control dependences are derived from the usual control flow graph. Many traditional optimizations operate more efficiently on the PDG. Since dependences in the PDG connect computationally related parts of the program, a single walk of these dependences is sufficient to perform many optimizations. The PDG allows transformations such as vectorization, that previously required special treatment of control dependence, to be performed in a manner that is uniform for both control and data dependences. Program transformations that require interaction of the two dependence types can also be easily handled with our representation. As an example, an incremental approach to modifying data dependences resulting from branch deletion or loop unrolling is introduced. The PDG supports incremental optimization, permitting transformations to be triggered by one another and applied only to affected dependences.
TL;DR: In this article, the authors present new algorithms that efficiently compute static single assignment forms and control dependence graphs for arbitrary control flow graphs using the concept of {\em dominance frontiers} and give analytical and experimental evidence that these data structures are usually linear in the size of the original program.
Abstract: In optimizing compilers, data structure choices directly influence the power and efficiency of practical program optimization. A poor choice of data structure can inhibit optimization or slow compilation to the point that advanced optimization features become undesirable. Recently, static single assignment form and the control dependence graph have been proposed to represent data flow and control flow properties of programs. Each of these previously unrelated techniques lends efficiency and power to a useful class of program optimizations. Although both of these structures are attractive, the difficulty of their construction and their potential size have discouraged their use. We present new algorithms that efficiently compute these data structures for arbitrary control flow graphs. The algorithms use {\em dominance frontiers}, a new concept that may have other applications. We also give analytical and experimental evidence that all of these data structures are usually linear in the size of the original program. This paper thus presents strong evidence that these structures can be of practical use in optimization.
TL;DR: An intermediate program representation, called a program dependence graph or PDG, which summarizes not only the data dependences of each operation but also summarizes the control dependence of the operations, which allows transformations such as vectorization to be performed in a manner which is uniform for both data and control dependence.
Abstract: In this paper we present an intermediate program representation, called a program dependence graph or PDG, which summarizes not only the data dependences of each operation but also summarizes the control dependences of the operations. Data dependences represent only the relevant data flow relationships of the program. Analagously, control dependences represent only the relevant control flow relationships of the program, in contrast to the usual control flow graph. The PDG allows transformations such as vectorization, which previously required special treatuent of control dependence, to be performed in a manner which is uniform for both control and data dependences. Program transformations which require interaction of the two can also be easily handled by the representation. As an example, a new incremental approach to modifying data dependences resulting from branch deletion is introduced. Another value of our representation is that many traditional optimizations operate more efficiently on the PDG. Since dependences in the PDG connect computationally relevant parts of the program, a single walk of these dependences is sufficient to perform many optimizations.
TL;DR: This work proposes a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy.
Abstract: The problem of cross-platform binary code similarity detection aims at detecting whether two binary functions coming from different platforms are similar or not. It has many security applications, including plagiarism detection, malware detection, vulnerability search, etc. Existing approaches rely on approximate graph-matching algorithms, which are inevitably slow and sometimes inaccurate, and hard to adapt to a new task. To address these issues, in this work, we propose a novel neural network-based approach to compute the embedding, i.e., a numeric vector, based on the control flow graph of each binary function, then the similarity detection can be done efficiently by measuring the distance between the embeddings for two functions. We implement a prototype called Gemini. Our extensive evaluation shows that Gemini outperforms the state-of-the-art approaches by large margins with respect to similarity detection accuracy. Further, Gemini can speed up prior art's embedding generation time by 3 to 4 orders of magnitude and reduce the required training time from more than 1 week down to 30 minutes to 10 hours. Our real world case studies demonstrate that Gemini can identify significantly more vulnerable firmware images than the state-of-the-art, i.e., Genius. Our research showcases a successful application of deep learning on computer security problems.
TL;DR: Four algorithms, all conservitive in the sense that all constants may not be found, but each constant found is constant over all possible executions of the program, are presented.
Abstract: Constant propagation is a well-known global flow analysis problem. The goal of constant propagation is to discover values that are constant on all possible executions of a program and to propagate these constant values as far foward through the program as possible. Expressions whose operands are all constants can be evaluated at compile time and the results propagated further. Using the algorithms presented in this paper can produce smaller and faster compiled programs. The same algorithms can be used for other kinds of analyses (e.g., type of determination). We present four algorithms in this paper, all conservitive in the sense that all constants may not be found, but each constant found is constant over all possible executions of the program. These algorithms are among the simplest, fastest, and most powerful global constant propagation algorithms known. We also present a new algorithm that performs a form of interprocedural data flow analysis in which aliasing information is gathered in conjunction with constant progagation. Several variants of this algorithm are considered.