What is the idea of looking independently at each bottleneck to bound the execution time?

The idea of looking independently at each bottleneck to bound the execution time is developed in the roofline model (Williams et al. 2009) and known to work well on highly parallel architectures and deep memory hierarchies, where parallelism or a dominating factor hides secondary bottlenecks.

How many implementations of each function can be provided?

On 256 × 256 matrices, the authors beat Cublas by a wide margin (4.2× speedup on average) thanks to Telamon’s ability to adapt the generated code to the size of the input, while libraries can only provide a few implementations of each function.

How does the author find the promising branches?

(De Mesmay et al. 2009) proposes to detect the most promising branches to be explored using a Monte-Carlo approach: a branch is evaluated by randomly selecting a few candidate implementations among its descendants and by evaluating them on the GPU.

What is the function that prunes a part of the search space?

the bound follows the execution time and allows to prune a part of the search space: all candidates whose lower bound is higher than the best execution time are pruned.

What is the used function in deep learning?

It is implemented by hardware vendors for all GPUs and is among the most-used functions in compute-intensive applications such as deep-learning.

How is the performance model made possible?

This is made possible through a performance model that provides a lower bound on the execution time of the codes that may be generated from a partially specified implementation.

Open AccessProceedings Article10.1145/3033019.3033023

Optimization space pruning without regrets

Ulysse Beaugnon, +4 more

- 05 Feb 2017

- pp 34-44

TL;DR: A novel approach to automatically discover the best performing code from a given set of possible implementations, involving a branch and bound algorithm with two distinctive features: an analytic performance model of a lower bound on the execution time, and the ability to estimate such bounds on a partially-specified implementation.

Abstract: Many computationally-intensive algorithms benefit from the wide parallelism offered by Graphical Processing Units (GPUs). However, the search for a close-to-optimal implementation remains extremely tedious due to the specialization and complexity of GPU architectures. We present a novel approach to automatically discover the best performing code from a given set of possible implementations. It involves a branch and bound algorithm with two distinctive features: (1) an analytic performance model of a lower bound on the execution time, and (2) the ability to estimate such bounds on a partially-specified implementation. The unique features of this performance model allow to aggressively prune the optimization space without eliminating the best performing implementation. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures. We implemented our algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions in "Optimization space pruning without regrets" ?

The authors present a novel approach to automatically discover the best performing code from a given set of possible implementations.. While the space considered in this paper focuses on GPUs, the approach is generic enough to be applied to other architectures.. The authors implemented their algorithm in a tool called Telamon and demonstrate its effectiveness on a huge, architecture-specific and input-sensitive optimization space.. The information provided by the performance model also helps to identify ways to enrich the search space to consider better candidates, or to highlight architectural bottlenecks.

2. What future works have the authors mentioned in the paper "Optimization space pruning without regrets" ?

The intermediate representation would need to be extended to capture that domain ’ s computational structure and to adapt the performance model to consider the bottlenecks of the target platform.. One interesting possibility would be to generate domain-specific intermediate representations and the associated exploration scheme from a high level description of the optimization requirements.

3. What is the way to compute the bounds at the level of a thread?

When the authors compute the bounds at the level of a thread or block, the authors assume it runs in isolationas contentions between threads or between blocks are taken into account by the bounds at a coarser level of parallelism.

4. What is the way to solve the problem of parallelism?

It requires a deep knowledge of the architecture and many trials and errors for the programmer to explore implementation alternatives at each level of parallelism, from thread-local optimizations to the mapping of computations across the entire processor.

Table 5. Performance of the code generated by Telamon

Figure 2. Nodes pruned at each level of the search tree

Table 4. Pair-wise ordering of instructions and dimensions

Table 1. Implementation choices for iteration dimensions

Table 2. Implementation choices for memory instructions

Citations

Proceedings Article•10.1145/3313808.3313819

Dynamic application reconfiguration on heterogeneous hardware

Juan Fumero, +5 more

- 14 Apr 2019

TL;DR: Through TornadoVM, a virtual machine capable of reconfiguring applications, at runtime, for hardware acceleration based on the currently available hardware resources, this paper introduces a new level of compilation in which applications can benefit from heterogeneous hardware.

...read moreread less

•Journal Article•10.1145/3469030

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

Tobias Gysi, +8 more

- 03 Sep 2021

- ACM Transactions on Architecture and Cod...

TL;DR: In this paper, the authors propose a multi-level rewriting approach for the weather and climate domain, which instantiates a hierarchy of dialects, lowers programs level-by-level, and performs code transformations at the most suitable level.

...read moreread less

•Journal Article•10.1145/3570641

Autotuning Convolutions Is Easier Than You Think

Nicolas Tollenaere, +7 more

- 08 Nov 2022

- ACM Transactions on Architecture and Cod...

TL;DR: In this article , the authors define a structured configuration space that enables much faster convergence to high-performance code versions, using only random sampling of candidates, and analyze the impact on convergence speed and performance distribution, on two Intel x86 processors and one ARM AArch64 processor.

...read moreread less

•Journal Article•10.1145/3109482

Iterative Schedule Optimization for Parallelization in the Polyhedron Model

Stefan Ganser, +4 more

- 22 Aug 2017

- ACM Transactions on Architecture and Cod...

TL;DR: The approach of iterative optimization outperforms existing optimization techniques in that it finds loop transformations that yield significantly higher performance and is evaluated against existing iterative and model-driven optimization strategies.

...read moreread less

•Proceedings Article•10.1145/3567955.3567961

TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

Martin Haas, +3 more

- 19 Dec 2022

TL;DR: In this article , a heuristic-based search with domain-specific knowledge is combined with a solver to solve the memory allocation problem in the context of machine learning systems, where the task of the memory allocator is to choose buffer locations in device memory such that the total amount of used memory never exceeds the total memory available on-device.

...read moreread less

...

Expand

References

•Journal Article•10.1145/1498765.1498785

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, +2 more

- 01 Apr 2009

- Communications of The ACM

TL;DR: The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.

...read moreread less

2.6K

Proceedings Article•10.1145/1401132.1401152

Scalable parallel programming with CUDA

John R. Nickolls, +3 more

- 11 Aug 2008

TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

...read moreread less

2.3K

•Journal Article•10.1109/MCSE.2010.69

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

John E. Stone, +2 more

- 01 May 2010

- Computing in Science and Engineering

TL;DR: The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators as mentioned in this paper, such as accelerators.

...read moreread less

1.7K

GPU Computing

John D. Owens, +5 more

- 01 May 2008

TL;DR: The background, hardware, and programming model for GPU computing is described, the state of the art in tools and techniques are summarized, and four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications are presented.

...read moreread less

1.7K

•Journal Article•10.1145/1365490.1365500

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

John R. Nickolls, +3 more

- 01 Mar 2008

- ACM Queue

TL;DR: In this article, the authors present a framework to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism on manycore GPUs with widely varying numbers of cores.

...read moreread less

1.4K

...

Expand

Optimization space pruning without regrets

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "Optimization space pruning without regrets" ?

2. What future works have the authors mentioned in the paper "Optimization space pruning without regrets" ?

3. What is the way to compute the bounds at the level of a thread?

4. What is the way to solve the problem of parallelism?

5. What is the idea of looking independently at each bottleneck to bound the execution time?

6. How many implementations of each function can be provided?

7. How does the author find the promising branches?

8. What is the function that prunes a part of the search space?

9. What is the used function in deep learning?

10. How is the performance model made possible?

Figures

Citations

Dynamic application reconfiguration on heterogeneous hardware

Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation

Autotuning Convolutions Is Easier Than You Think

Iterative Schedule Optimization for Parallelization in the Polyhedron Model

TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

References

Roofline: an insightful visual performance model for multicore architectures

Scalable parallel programming with CUDA

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

GPU Computing

Scalable Parallel Programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for?

Related Papers (5)

Think globally, search locally

FANG: Fast and Efficient Successor-State Generation for Heuristic Optimization on GPUs

Utility distribution matters: enabling fast belief propagation for multi-agent optimization with dense local utility function

Accelerating Exact and Approximate Inference for (Distributed) Discrete Optimization with GPUs

Data intensive parallel tree algorithm patterns based on GPUs