Open AccessDissertation
Dynamic optimization of data-flow task-parallel applications for large-scale NUMA systems
Andi Drebes
- 25 Jun 2015
2
TL;DR: This thesis explores the main factors on performance and data locality of task-paralel programs and proposes a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance.
read more
Abstract: Within the last decade, microprocessor development reached a point at which higher clock rates and more complex micro-architectures became less energy-efficient, such that power consumption and energy density were pushed beyond reasonable limits. As a consequence, the industry has shifted to more energy efficient multi-core designs, integrating multiple processing units (cores) on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA). Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, the execution of task-paralel programs on architectures with non-uniform memory access and the dynamic optimizations to mitigate NUMA effects have received only little interest. In this thesis, we explore the main factors on performance and data locality of task-parallel programs and propose a set of transparent, portable and fully automatic on-line mapping mechanisms for tasks to cores and data to memory controllers in order to improve data locality and performance. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques is conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Automatic Detection of Performance Anomalies in Task-Parallel Programs
Andi Drebes,Karine Heydemann,Antoniu Pop,Albert Cohen,Nathalie Drach +4 more
- 29 May 2014
TL;DR: Aftermath as mentioned in this paper is an interactive tool for trace-based performance analysis and debugging of task-parallel programs and run-time systems, which provides visualization, statistics and analysis tools adapted for performance debugging at task granularity.
4
Bounded Stream Scheduling in Polyhedral OpenStream
Nuno Miguel Nobre,Andi Drebes,Graham Riley,Antoniu Pop +3 more
- 22 Jan 2020
TL;DR: This work exploits an approach that certifies that augmenting the intrinsic dataflow dependencies of the program with stream bounding constraints does not deadlock the program: it cannot show the existence of a deadlock but can give a certificate for the absence of it.
1
References
•Book
Lapack Users' Guide
Ed Anderson
- 01 Feb 1995
TL;DR: The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS.
3.2K
Sorting networks and their applications
Kenneth E. Batcher
- 30 Apr 1968
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
The Nas Parallel Benchmarks
David H. Bailey,Eric Barszcz,John T. Barton,D. S. Browning,Russell Carter,Leonardo Dagum,Rod Fatoohi,Paul O. Frederickson,T. A. Lasinski,Robert Schreiber,Horst D. Simon,V. Venkatakrishnan,Sisira Weeratunga +12 more
- 01 Sep 1991
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Cilk: An Efficient Multithreaded Runtime System
Robert D. Blumofe,Christopher F. Joerg,Bradley C. Kuszmaul,Charles E. Leiserson,Keith H. Randall,Yuli Zhou +5 more
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
1.7K
The implementation of the Cilk-5 multithreaded language
Matteo Frigo,Charles E. Leiserson,Keith H. Randall +2 more
- 01 May 1998
TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.