Scalable parallelism

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1109/71.97902•

A loop transformation theory and an algorithm to maximize parallelism

[...]

Michael Wolf¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

01 Oct 1991-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.

...read moreread less

Abstract: An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five. >

...read moreread less

727 citations

Journal Article•10.1145/960116.54021•

Optimal loop parallelization

[...]

Alex Aiken¹, Alexandru Nicolau¹•Institutions (1)

Cornell University¹

1 Jun 1988

TL;DR: This paper presents a new technique bridging the gap between fine-and coarse-grain loop parallelization, allowing the exploitation of parallelism inside and across loop iterations, and shows that, given a loop and a set of dependencies between its statements, the execution schedule is time optimal.

...read moreread less

Abstract: Parallelizing compilers promise to exploit the parallelism available in a given program, particularly parallelism that is too low-level or irregular to be expressed by hand in an algorithm. However, existing parallelization techniques do not handle loops in a satisfactory manner. Fine-grain (instruction level) parallelization, or compaction, captures irregular parallelism inside a loop body but does not exploit parallelism across loop iterations. Coarser methods, such as doacross [9], sacrifice irregular forms of parallelism in favor of pipelining iterations (software pipelining). Both of these approaches often yield suboptimal speedups even under the best conditions-when resources are plentiful and processors are synchronous. In this paper we present a new technique bridging the gap between fine-and coarse-grain loop parallelization, allowing the exploitation of parallelism inside and across loop iterations. Furthermore, we show that, given a loop and a set of dependencies between its statements, the execution schedule obtained by our transformation is time optimal: no transformation of the loop based on the given data-dependencies can yield a shorter running time for that loop.

...read moreread less

266 citations

Book•

Parallelism in production systems

[...]

Anoop Gupta

1 Jan 1987

TL;DR: The thesis proposes a parallel version of the Rete match algorithm which exploits parallelism at a very fine grain, and suggests that a suitable architecture to exploit the fine-grained parallelism is a shared-memory multiprocessor, with 32-64 high performance processors.

...read moreread less

Abstract: Production system programs, on the surface, appear to be capable of using large amounts of parallelism--it is possible to match each production in a program to the data memory in parallel. The thesis shows that in practice, however, the speed-up obtainable from parallelism is quite limited, around 15-fold as compared to initial expectations of 100-fold to 1000-fold. The main reasons for the limited speed-up from parallelism are: (1) there are only a small number of productions that require significant processing as a result of a change to working memory; and (2) there is a large variation in the processing requirement of these productions. To obtain a large fraction of the limited speed-up that is available, the thesis proposes a parallel version of the Rete match algorithm which exploits parallelism at a very fine grain. It further suggests that a suitable architecture to exploit the fine-grained parallelism is a shared-memory multiprocessor, with 32-64 high performance processors. For scheduling the fine grained tasks (consisting of about 50-100 instructions), a hardware task scheduler is proposed. The thesis presents a large set of simulation results for production systems exploiting different sources of parallelism. The thesis points out the features of existing programs that limit the speed-up obtainable from parallelism and suggests solutions for some of the bottlenecks. The simulation results show that using the suggested multiprocessor architecture (with individual processors performing at 2 MIPS), it is possible to obtain execution speeds of 5000-27000 working memory element changes per second. This corresponds to a speed-up of 5-fold over the best known sequential implementation using a 2 MIPS processor. This performance is also higher than that obtained by other proposed parallel implementations of production systems.

...read moreread less

224 citations

Proceedings Article•10.1109/IPDPS.2009.5161079•

Work-first and help-first scheduling policies for async-finish task parallelism

[...]

Yi Guo¹, Rajkishore Barik¹, Raghavan Raman¹, Vivek Sarkar¹•Institutions (1)

Rice University¹

23 May 2009

TL;DR: This paper introduces a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work- first and help-first scheduling policies, and provides insights on scenarios in which the help- first policy yields better results than the work-first policy and vice versa.

...read moreread less

Abstract: Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft Task Parallel Library, Intel Thread Building Blocks, Cilk, X10, Chapel, and Fortress. Scheduling algorithms based on work stealing, as embodied in Cilk's implementation of dynamic spawn-sync parallelism, are gaining in popularity but also have inherent limitations. In this paper, we address the problem of efficient and scalable implementation of X10's async-finish task parallelism, which is more general than Cilk's spawn-sync parallelism. We introduce a new work-stealing scheduler with compiler support for async-finish task parallelism that can accommodate both work-first and help-first scheduling policies. Performance results on two different multicore SMP platforms show significant improvements due to our new work-stealing algorithm compared to the existing work-sharing scheduler for X10, and also provide insights on scenarios in which the help-first policy yields better results than the work-first policy and vice versa.

...read moreread less

194 citations

Proceedings Article•10.1145/139669.140395•

Dynamic dependency analysis of ordinary programs

[...]

Todd Austin¹, Gurindar S. Sohi•Institutions (1)

University of Wisconsin-Madison¹

1 Apr 1992

TL;DR: This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperatibve language such as C or FORTRAN) from a serial execution trace of the program and uses the methodology to study parallelism in the SPEC benchmarks.

...read moreread less

Abstract: A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, are not sufficient; characterizing the exact nature of dependencies between operations is essential.This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperatibve language such as C or FORTRAN) from a serial execution trace of the program. It then uses the methodology to study parallelism in the SPEC benchmarks. We see that the prallelism can be bursty in nature (periods of lots of parallelism followed by periods of little parallelism), but the average parallelism is quite high, ranging from 13 to 23,302 operations per cycle. Exposing this parallelism requires renaming of both registers and memory, though renaming registers alone exposes much of this parallelism. We also see that fairly large windows of dynamic instructions would be required to expose this parallelism from a sequential instruction stream.

...read moreread less

194 citations

...

Expand

Year	Papers
2021	1
2020	3
2019	4
2018	2
2017	10
2016	8

Topic Tools

Papers published on a yearly basis

Papers

A loop transformation theory and an algorithm to maximize parallelism

Optimal loop parallelization

Parallelism in production systems

Work-first and help-first scheduling policies for async-finish task parallelism

Dynamic dependency analysis of ordinary programs

Related Topics (5)

Performance Metrics