TL;DR: The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.
Abstract: An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five. >
TL;DR: This paper presents a new technique bridging the gap between fine-and coarse-grain loop parallelization, allowing the exploitation of parallelism inside and across loop iterations, and shows that, given a loop and a set of dependencies between its statements, the execution schedule is time optimal.
Abstract: Parallelizing compilers promise to exploit the parallelism available in a given program, particularly parallelism that is too low-level or irregular to be expressed by hand in an algorithm. However, existing parallelization techniques do not handle loops in a satisfactory manner. Fine-grain (instruction level) parallelization, or compaction, captures irregular parallelism inside a loop body but does not exploit parallelism across loop iterations. Coarser methods, such as doacross [9], sacrifice irregular forms of parallelism in favor of pipelining iterations (software pipelining). Both of these approaches often yield suboptimal speedups even under the best conditions-when resources are plentiful and processors are synchronous. In this paper we present a new technique bridging the gap between fine-and coarse-grain loop parallelization, allowing the exploitation of parallelism inside and across loop iterations. Furthermore, we show that, given a loop and a set of dependencies between its statements, the execution schedule obtained by our transformation is time optimal: no transformation of the loop based on the given data-dependencies can yield a shorter running time for that loop.
TL;DR: The thesis proposes a parallel version of the Rete match algorithm which exploits parallelism at a very fine grain, and suggests that a suitable architecture to exploit the fine-grained parallelism is a shared-memory multiprocessor, with 32-64 high performance processors.
Abstract: Production system programs, on the surface, appear to be capable of using large amounts of parallelism--it is possible to match each production in a program to the data memory in parallel. The thesis shows that in practice, however, the speed-up obtainable from parallelism is quite limited, around 15-fold as compared to initial expectations of 100-fold to 1000-fold. The main reasons for the limited speed-up from parallelism are: (1) there are only a small number of productions that require significant processing as a result of a change to working memory; and (2) there is a large variation in the processing requirement of these productions. To obtain a large fraction of the limited speed-up that is available, the thesis proposes a parallel version of the Rete match algorithm which exploits parallelism at a very fine grain. It further suggests that a suitable architecture to exploit the fine-grained parallelism is a shared-memory multiprocessor, with 32-64 high performance processors. For scheduling the fine grained tasks (consisting of about 50-100 instructions), a hardware task scheduler is proposed.
The thesis presents a large set of simulation results for production systems exploiting different sources of parallelism. The thesis points out the features of existing programs that limit the speed-up obtainable from parallelism and suggests solutions for some of the bottlenecks. The simulation results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of 5000-27000 working memory element changes per second. This corresponds to a speed-up of 5-fold over the best known sequential implementation using a 2 MIPS processor. This performance is also higher than that obtained by other proposed parallel implementations of production systems.
TL;DR: This paper introduces a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work- first and help-first scheduling policies, and provides insights on scenarios in which the help- first policy yields better results than the work-first policy and vice versa.
Abstract: Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilk's implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10's async-finish task parallelism, which is more general than Cilk's spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insights on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.
TL;DR: This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperatibve language such as C or FORTRAN) from a serial execution trace of the program and uses the methodology to study parallelism in the SPEC benchmarks.
Abstract: A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, are not sufficient; characterizing the exact nature of dependencies between operations is essential.This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperatibve language such as C or FORTRAN) from a serial execution trace of the program. It then uses the methodology to study parallelism in the SPEC benchmarks. We see that the prallelism can be bursty in nature (periods of lots of parallelism followed by periods of little parallelism), but the average parallelism is quite high, ranging from 13 to 23,302 operations per cycle. Exposing this parallelism requires renaming of both registers and memory, though renaming registers alone exposes much of this parallelism. We also see that fairly large windows of dynamic instructions would be required to expose this parallelism from a sequential instruction stream.