A compiler optimization algorithm for shared-memory multiprocessors

doi:10.1109/71.706049

Journal Article10.1109/71.706049

A compiler optimization algorithm for shared-memory multiprocessors

Kathryn S. McKinley

- 01 Aug 1998

- IEEE Transactions on Parallel and Distri...

- Vol. 9, Iss: 8, pp 769-787

40

TL;DR: A new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors and suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from the authors' compiler optimizations.

Abstract: This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/782814.782836

Estimating cache misses and locality using stack distances

Calin CaΒcaval, +1 more

- 23 Jun 2003

TL;DR: This paper presents a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms, which provides a very good approximation for set-associative caches and programs with non-constant dependence distances.

...read moreread less

194

Patent

Loop optimization with mapping code on an architecture

Koen Danckaert, +1 more

- 31 Jan 2000

TL;DR: In this paper, a loop transformation step, to improve data locality and regularity of the algorithm described by the code, is presented, which works globally and is feasible for realistic code sizes.

...read moreread less

57

Journal Article•10.1016/J.SIMPAT.2006.11.014

Performance modeling of communication and computation in hybrid MPI and OpenMP applications

Laksono Adhianto, +1 more

- 01 Apr 2007

- Simulation Modelling Practice and Theory

TL;DR: The construction of a model that is based upon a small number of parameters, but is able to capture the complexity of the runtime system is proposed, and how this tool can be applied to a sample code is shown.

...read moreread less

56

Proceedings Article•10.1109/ICPADS.2006.81

Performance modeling of communication and computation in hybrid MPI and OpenMP applications

Laksono Adhianto, +1 more

- 12 Jul 2006

TL;DR: This paper proposes the construction of a model that is based upon a small number of parameters, but is able to capture the complexity of the runtime system, and describes the underlying framework, the performance model, and shows how it can be applied to a sample code.

...read moreread less

52

Journal Article•10.1016/J.JPDC.2005.09.003

Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

Matthias Korch, +1 more

- 01 Mar 2006

- Journal of Parallel and Distributed Comp...

TL;DR: This paper considers embedded Runge-Kutta methods for the solution of ordinary differential equations and explores how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability.

...read moreread less

43

...

Expand

References

•Book

LINPACK Users' Guide

Jack Dongarra, +3 more

- 01 Jan 1987

TL;DR: General matrices Band matrices positive definite matrices Positive definite band matrices Symmetric Indefinite Matrices Triangular matrices Tridiagonal matrices The Cholesky decomposition The QR decomposition up to and including the singular value decomposition is studied.

...read moreread less

1.7K

Proceedings Article•10.1145/567532.567555

Dependence graphs and compiler optimizations

David J. Kuck, +4 more

- 26 Jan 1981

TL;DR: This paper defines such graphs and discusses two kinds of transformations, simple rewriting transformations that remove dependence arcs and abstraction transformations that deal more globally with a dependence graph.

...read moreread less

752

Journal Article•10.1109/71.97902

A loop transformation theory and an algorithm to maximize parallelism

Michael Wolf, +1 more

- 01 Oct 1991

- IEEE Transactions on Parallel and Distri...

TL;DR: The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.

...read moreread less

727

•Journal Article•10.1145/233561.233564

Improving data locality with loop transformations

Kathryn S. McKinley, +2 more

- 01 Jul 1996

- ACM Transactions on Programming Language...

TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.

...read moreread less

590

•Book Chapter•10.1007/978-3-642-48417-9_2

Direct Search Methods on Parallel Machines

John E. Dennis, +1 more

- 01 Nov 1991

- Siam Journal on Optimization

TL;DR: Direct search methods are methods designed to solve unconstrained minimization problems of the form min x in R n f(x), distinguished by the fact that they neither use nor require explicit derivative information; the search for a local minimizer is driven solely by function information.

...read moreread less

352

...

Expand

A compiler optimization algorithm for shared-memory multiprocessors

Chat with Paper

AI Agents for this Paper

Citations

Estimating cache misses and locality using stack distances

Loop optimization with mapping code on an architecture

Performance modeling of communication and computation in hybrid MPI and OpenMP applications

Performance modeling of communication and computation in hybrid MPI and OpenMP applications

Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining

References

LINPACK Users' Guide

Dependence graphs and compiler optimizations

A loop transformation theory and an algorithm to maximize parallelism

Improving data locality with loop transformations

Direct Search Methods on Parallel Machines

Related Papers (5)

Improving data locality with loop transformations

A data locality optimizing algorithm

Exploiting task and data parallelism on a multicomputer

A matching approach to utilizing fine-grained parallelism

Parallelization of benchmarks for scalable shared-memory multiprocessors