Automatic and interactive parallelization

Open AccessDissertation

Automatic and interactive parallelization

- 01 Jan 1992

45

TL;DR: This dissertation provides automatic compilation techniques that tailor parallel algorithms to shared-memory multiprocessors with local caches and a common bus, and develops novel, general algorithms to transform loops that contain arbitrary conditional control flow that are applicable to complete programs.

Abstract: The goal of this dissertation is to give programmers the ability to achieve high performance by focusing on developing parallel algorithms, rather than on architecture-specific details. The advantages of this approach also include program portability and legibility. To achieve high performance, we provide automatic compilation techniques that tailor parallel algorithms to shared-memory multiprocessors with local caches and a common bus. In particular, the compiler maps complete applications onto the specifics of a machine, exploiting both parallelism and memory. To optimize complete applications, we develop novel, general algorithms to transform loops that contain arbitrary conditional control flow. In addition, we provide new interprocedural transformations which enable optimization across procedure boundaries. These techniques provide the basis for a robust automatic parallelizing algorithm that is applicable to complete programs. The algorithm for automatic parallel code generation takes into consideration the interaction of parallelism and data locality, as well as the overhead of parallelism. The algorithm is based on a simple cost model that accurately predicts cache line reuse from multiple accesses to the same memory location and from consecutive accesses. The optimizer uses this model to improve data locality. It also uses the model to discover and introduce effective parallelism that complements the benefits of data locality. The optimizer further improves the effectiveness of parallelism by seeking to increase its granularity. Parallelism is introduced only when granularity is sufficient to overcome its associated costs. The algorithm for parallel code generation is shown to be efficient and several of its component algorithms are proven optimal. The efficacy of the optimizer is illustrated with experimental results. In most cases, it is very effective and either achieves or improves the performance of hand-crafted parallel programs. When performance is not satisfactory, we provide an interactive parallel programming tool which combines compiler analysis and algorithms with human expertise.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/233561.233564

Improving data locality with loop transformations

Kathryn S. McKinley, +2 more

- 01 Jul 1996

- ACM Transactions on Programming Language...

TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.

...read moreread less

590

Proceedings Article•10.1145/782814.782836

Estimating cache misses and locality using stack distances

Calin CaΒcaval, +1 more

- 23 Jun 2003

TL;DR: This paper presents a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms, which provides a very good approximation for set-associative caches and programs with non-constant dependence distances.

...read moreread less

194

Proceedings Article•10.1145/155332.155336

Experiences using the ParaScope Editor: an interactive parallel programming tool

Mary Hall, +7 more

- 01 Jul 1993

TL;DR: The ParaScope Editor is a new kind of program construction tool; one that not only manages text, but also presents the user with insights into the semantic structure of the program being constructed.

...read moreread less

81

Optimization within a unified transformation framework

Wayne Kelly, +1 more

- 01 Jan 1996

48

Journal Article•10.1109/71.706049

A compiler optimization algorithm for shared-memory multiprocessors

Kathryn S. McKinley

- 01 Aug 1998

- IEEE Transactions on Parallel and Distri...

TL;DR: A new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors and suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from the authors' compiler optimizations.

...read moreread less

40

...

Expand

References

•Book

The Design and Analysis of Computer Algorithms

Alfred V. Aho, +1 more

- 01 Jan 1974

TL;DR: This text introduces the basic data structures and programming techniques often used in efficient algorithms, and covers use of lists, push-down stacks, queues, trees, and graphs.

...read moreread less

10.6K

•Journal Article•10.1145/24039.24041

The program dependence graph and its use in optimization

Jeanne Ferrante, +2 more

- 01 Jul 1987

- ACM Transactions on Programming Language...

TL;DR: An intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependences for each operation in a program, allowing transformations to be triggered by one another and applied only to affected dependences.

...read moreread less

2.8K

•Book Chapter•10.1007/3-540-12925-1_33

The program Dependence Graph and its Use in Optimization

Jeanne Ferrante, +2 more

- 17 Apr 1984

TL;DR: An intermediate program representation, called a program dependence graph or PDG, which summarizes not only the data dependences of each operation but also summarizes the control dependence of the operations, which allows transformations such as vectorization to be performed in a manner which is uniform for both data and control dependence.

...read moreread less

1.8K

Proceedings Article•10.1145/106972.106981

The cache performance and optimizations of blocked algorithms

Monica D. Lam, +2 more

- 01 Apr 1991

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

...read moreread less

1K

An extended set of Fortran Basic Linear Algebra Subprograms: model implementation and test programs

Jack Dongarra, +3 more

- 01 Jan 1987

TL;DR: In this article, a model implementation and test software for Level 2 Basic Linear Algebra Subprograms (Level 2 BLAS) is described, targeted at matrix-vector operations with the aim of providing more efficient, but portable, implementations of algorithms on high-performance computers.

...read moreread less

942

...

Expand

Automatic and interactive parallelization

Chat with Paper

AI Agents for this Paper

Citations

Improving data locality with loop transformations

Estimating cache misses and locality using stack distances

Experiences using the ParaScope Editor: an interactive parallel programming tool

Optimization within a unified transformation framework

A compiler optimization algorithm for shared-memory multiprocessors

References

The Design and Analysis of Computer Algorithms

The program dependence graph and its use in optimization

The program Dependence Graph and its Use in Optimization

The cache performance and optimizations of blocked algorithms

An extended set of Fortran Basic Linear Algebra Subprograms: model implementation and test programs

Related Papers (5)

A data locality optimizing algorithm

Interprocedural transformations for parallel code generation

Dependence graphs and compiler optimizations

The cache performance and optimizations of blocked algorithms

On Estimating and Enhancing Cache Effectiveness