TL;DR: A binary obfuscation scheme that relies on opaque constants, which are primitives that allow us to load a constant into a register such that an analysis tool cannot determine its value, demonstrates that static analysis techniques alone might no longer be sufficient to identify malware.
Abstract: Malicious code is an increasingly important problem that threatens the security of computer systems. The traditional line of defense against malware is composed of malware detectors such as virus and spyware scanners. Unfortunately, both researchers and malware authors have demonstrated that these scanners, which use pattern matching to identify malware, can be easily evaded by simple code transformations. To address this shortcoming, more powerful malware detectors have been proposed. These tools rely on semantic signatures and employ static analysis techniques such as model checking and theorem proving to perform detection. While it has been shown that these systems are highly effective in identifying current malware, it is less clear how successful they would be against adversaries that take into account the novel detection mechanisms. The goal of this paper is to explore the limits of static analysis for the detection of malicious code. To this end, we present a binary obfuscation scheme that relies on the idea of opaque constants, which are primitives that allow us to load a constant into a register such that an analysis tool cannot determine its value. Based on opaque constants, we build obfuscation transformations that obscure program control flow, disguise access to local and global variables, and interrupt tracking of values held in processor registers. Using our proposed obfuscation approach, we were able to show that advanced semantics-based malware detectors can be evaded. Moreover, our opaque constant primitive can be applied in a way such that is provably hard to analyze for any static code analyzer. This demonstrates that static analysis techniques alone might no longer be sufficient to identify malware.
TL;DR: An induction principle is presented which treats the control and data state sets on the same ground and it is observed that certain correctness conditions can be expressed without enumeration of the set of all possible control states.
Abstract: Two formal models for parallel computation are presented: an abstract conceptual model and a parallel-program model. The former model does not distinguish between control and data states. The latter model includes the capability for the representation of an infinite set of control states by allowing there to be arbitrarily many instruction pointers (or processes) executing the program. An induction principle is presented which treats the control and data state sets on the same ground. Through the use of “place variables,” it is observed that certain correctness conditions can be expressed without enumeration of the set of all possible control states. Examples are presented in which the induction principle is used to demonstrate proofs of mutual exclusion. It is shown that assertions-oriented proof methods are special cases of the induction principle. A special case of the assertions method, which is called parallel place assertions, is shown to be incomplete. A formalization of “deadlock” is then presented. The concept of a “norm” is introduced, which yields an extension, to the deadlock problem, of Floyd's technique for proving termination. Also discussed is an extension of the program model which allows each process to have its own local variables and permits shared global variables. Correctness of certain forms of implementation is also discussed. An Appendix is included which relates this work to previous work on the satisfiability of certain logical formulas.
TL;DR: A side effect occurs when a procedure or function changes the value of a global variable as mentioned in this paper, which is usually unplanned and therefore undesirable, and it is one reason why the use of global variables is deplored in programming.
Abstract: A side effect occurs when a procedure or function changes the value of a global variable. This is one reason why the use of global variables is deplored in programming; they allow the possibility of side effects, which are usually unplanned and, therefore, undesirable. But this is not always so; sometimes, as in database systems, the database itself is global to all procedures, and modifying it is just what many of the procedures in the system are supposed to do. Another example would be a procedure to generate random numbers where the ith random number, Xi, is generated as f(Xi-1) for some function f. A global variable X would contain the value of Xi-1 and a side effect would replace this with Xi. A third example of benign side effects is in functions, such as input and output functions, that both produce results and return a value giving information about them. For example, the scanf function in the CI/O library reads values into its arguments (a side effect) and returns a count of the number of items read.
TL;DR: A consequence of this spectrum of requirements is empirically shown by comparing the C call graphs extracted from three software systems by five extraction tools, which shows considerable variation.
Abstract: Informally, a call graph represents calls between entities in a given program. The call graphs that compilers compute to determine the applicability of an optimization must typically be conservative: a call may be omitted only if it can never occur in any execution of the program. Numerous software engineering tools also extract call graphs with the expectation that they will help software engineers increase their understanding of a program. The requirements placed on software engineering tools that compute call graphs are typically more relaxed than for compilers. For example, some false negatives—calls that can in fact take place in some execution of the program, but which are omitted from the call graph—may be acceptable, depending on the understanding task at hand. In this article, we empirically show a consequence of this spectrum of requirements by comparing the C call graphs extracted from three software systems (mapmaker, mosaic, and gcc) by nine tools (cflow, cawk, CIA, Field, GCT, Imagix, LSME, Mawk, and Rigiparse). A quantitative analysis of the call graphs extracted for each system shows considerable variation, a result that is counterintuitive to many experienced software engineers. A qualitative analysis of these results reveals a number of reasons for this variation: differing treatments of macros, function pointers, input formats, etc. The fundamental problem is not that variances among the graphs extracted by different tools exist, but that software engineers have little sense of the dimensions of approximation in any particular call graph. In this article, we describe and discuss the study, sketch a design space for static call graph extractors, and discuss the impact of our study on practitioners, tool developers, and researchers. Although this article considers only one kind of information, call graphs, many of the observations also apply to static extractors of other kinds of information, such as inheritance structures, file dependences, and references to global variables.
TL;DR: Cilk++ as mentioned in this paper is a runtime system for multicore processors with a runtime compiler and a racedetection tool, which allows nonlocal variables to be mitigated without lock contention or substantial code restructuring.
Abstract: The availability of multicore processors across a wide range of computing platforms has created a strong demand for software frameworks that can harness these resources. This paper overviews the Cilk++ programming environment, which incorporates a compiler, a runtime system, and a race-detection tool. The Cilk++ runtime system guarantees to load-balance computations effectively. To cope with legacy codes containing global variables, Cilk++ provides a "hyperobject" library which allows races on nonlocal variables to be mitigated without lock contention or substantial code restructuring.