Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems

Open AccessDissertation

Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems

- 25 Jun 2015

2

TL;DR: This thesis explores the main factors on performance and data locality of task-paralel programs and proposes a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance.

Abstract: Within the last decade, microprocessor development reached a point at which higher clock rates and more complex micro-architectures became less energy-efficient, such that power consumption and energy density were pushed beyond reasonable limits. As a consequence, the industry has shifted to more energy efficient multi-core designs, integrating multiple processing units (cores) on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA). Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, the execution of task-paralel programs on architectures with non-uniform memory access and the dynamic optimizations to mitigate NUMA effects have received only little interest. In this thesis, we explore the main factors on performance and data locality of task-parallel programs and propose a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques is conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Automatic Detection of Performance Anomalies in Task-Parallel Programs

Andi Drebes, +4 more

- 29 May 2014

TL;DR: Aftermath as mentioned in this paper is an interactive tool for trace-based performance analysis and debugging of task-parallel programs and run-time systems, which provides visualization, statistics and analysis tools adapted for performance debugging at task granularity.

...read moreread less

4

Bounded Stream Scheduling in Polyhedral OpenStream

Nuno Miguel Nobre, +3 more

- 22 Jan 2020

TL;DR: This work exploits an approach that certifies that augmenting the intrinsic dataflow dependencies of the program with stream bounding constraints does not deadlock the program: it cannot show the existence of a deadlock but can give a certificate for the absence of it.

...read moreread less

1

References

•Book

Lapack Users' Guide

Ed Anderson

- 01 Feb 1995

TL;DR: The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS.

...read moreread less

3.2K

•Proceedings Article•10.1145/1468075.1468121

Sorting networks and their applications

Kenneth E. Batcher

- 30 Apr 1968

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.

...read moreread less

2.7K

Journal Article•10.1177/109434209100500306

The Nas Parallel Benchmarks

David H. Bailey, +12 more

- 01 Sep 1991

TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.

...read moreread less

2.4K

•Journal Article•10.1006/JPDC.1996.0107

Cilk: An Efficient Multithreaded Runtime System

Robert D. Blumofe, +5 more

- 25 Aug 1996

- Journal of Parallel and Distributed Comp...

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1.7K

•Proceedings Article•10.1145/277650.277725

The implementation of the Cilk-5 multithreaded language

Matteo Frigo, +2 more

- 01 May 1998

TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.

...read moreread less

1.4K

...

Expand

Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems

Chat with Paper

AI Agents for this Paper

Citations

Automatic Detection of Performance Anomalies in Task-Parallel Programs

Bounded Stream Scheduling in Polyhedral OpenStream

References

Lapack Users' Guide

Sorting networks and their applications

The Nas Parallel Benchmarks

Cilk: An Efficient Multithreaded Runtime System

The implementation of the Cilk-5 multithreaded language

Related Papers (5)

A library for portable and composable data locality optimizations for NUMA systems

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Empirical Memory-Access Cost Models in Multicore NUMA Architectures

Mis)understanding the NUMA memory system performance of multithreaded workloads

Locality Aware Task Scheduling in Parallel Data Stream Processing