TL;DR: It is shown how interval-schemata are analogous to Holland's symbol- schemata and provide a key to understanding the implicit parallelism of real-valued GAs and support the intuition that real-coded GAs should have an advantage over binary coded GAs in exploiting local continuities in function optimization.
Abstract: In this paper we introduce interval-schemata as a tool for analyzing real-coded genetic algorithms (GAs). We show how interval-schemata are analogous to Holland's symbol-schemata and provide a key to understanding the implicit parallelism of real-valued GAs. We also show how they support the intuition that real-coded GAs should have an advantage over binary coded GAs in exploiting local continuities in function optimization. On the basis of our analysis we predict some failure modes for real-coded GAs using several different crossover operators and present some experimental results that support these predictions. We also introduce a crossover operator for real-coded GAs that is able to avoid some of these failure modes.
TL;DR: The results of simulations of 18 different test programs under 375 different models of available parallelism analysis are presented, showing how simulations based on instruction traces can model techniques at the limits of feasibility and even beyond.
Abstract: Growing interest in ambitious multiple-issue machines and heavilypipelined machines requires a careful examination of how much instructionlevel parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction traces, we can model techniques at the limits of feasibility and even beyond. This paper presents the results of simulations of 18 different test programs under 375 different models of available parallelism analysis. This paper replaces Technical Note TN-15, an earlier version of the same material.
TL;DR: It is argued that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations, and that managing parallelism at the user level is essential to high-performance parallel computing.
Abstract: Threads are the vehicle for concurrency in many approaches to parallel programming. Threads separate the notion of a sequential execution stream from the other aspects of traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective of this separation is to make the expression and control of parallelism sufficiently cheap that the programmer or compiler can exploit even fine-grained parallelism with acceptable overhead.Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory. This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; we thus argue that managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the lack of system integration exhibited by user-level threads is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; we thus argue that kernel threads or processes, as currently conceived, are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.
TL;DR: An overview of current thinking on dataflow architecture is provided by describing example Id programs, their compilation to dataflow graphs, and their execution on the TTDA, a multiprocessor architecture.
Abstract: The MIT Tagged-Token Dataflow Project has an unconventional, but integrated approach to general-purpose high-performance parallel computing. Rather than extending conventional sequential languages, Id, a high-level language with fine-grained parallelism and determinacy implicit in its operational semantics, is used. Id programs are compiled to dynamic dataflow graphs, which constitute a parallel machine language. Dataflow graphs are directly executed on the MIT tagged-token dataglow architecture (TTDA), a multiprocessor architecture. An overview of current thinking on dataflow architecture is provided by describing example Id programs, their compilation to dataflow graphs, and their execution on the TTDA. Related work and the status of the project are described. >
TL;DR: This paper has developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests, and is able to exploit parallelism both across loop iterations and within basic blocks.
Abstract: Increasing focus on multimedia applications has prompted the addition of multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block.In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.