Book Chapter10.1007/978-3-642-32820-6_49
Hierarchical partitioning algorithm for scientific computing on highly heterogeneous CPU + GPU clusters
David Clarke,Aleksandar Ilic,Alexey Lastovetsky,Leonel Sousa +3 more
- 27 Aug 2012
- pp 489-501
TL;DR: Large scale experiments on a heterogeneous multi-cluster site incorporating multicore CPUs and GPU nodes show that the presented algorithm outperforms current state of the art approaches and successfully load balance very large problems.
read more
Abstract: Hierarchical level of heterogeneity exists in many modern high performance clusters in the form of heterogeneity between computing nodes, and within a node with the addition of specialized accelerators, such as GPUs. To achieve high performance of scientific applications on these platforms it is necessary to perform load balancing. In this paper we present a hierarchical matrix partitioning algorithm based on realistic performance models at each level of hierarchy. To minimise the total execution time of the application it iteratively partitions a matrix between nodes and partitions these sub-matrices between the devices in a node. This is a self-adaptive algorithm that dynamically builds the performance models at run-time and it employs an algorithm to minimise the total volume of communication. This algorithm allows scientific applications to perform load balanced matrix operations with nested parallelism on hierarchical heterogeneous platforms. To show the effectiveness of the algorithm we applied it to a fundamental operation in scientific parallel computing, matrix multiplication. Large scale experiments on a heterogeneous multi-cluster site incorporating multicore CPUs and GPU nodes show that the presented algorithm outperforms current state of the art approaches and successfully load balance very large problems.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Survey of CPU-GPU Heterogeneous Computing Techniques
Sparsh Mittal,Jeffrey S. Vetter +1 more
TL;DR: This article surveys Heterogeneous Computing Techniques (HCTs) such as workload partitioning that enable utilizing both CPUs and GPUs to improve performance and/or energy efficiency and reviews both discrete and fused CPU-GPU systems.
542
Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms
Olivier Beaumont,Brett A. Becker,Ashley DeFlumere,Lionel Eyraud-Dubois,Thomas Lambert,Alexey Lastovetsky +5 more
TL;DR: This paper presents recent approaches that relax the restriction that all partitions be rectangles and uses the first exact approach to analyse how close to the known optimal solutions the NRRP algorithm is for small numbers of partitions.
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
TL;DR: This work proposes a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms.
To waffinity and beyond: a scalable architecture for incremental parallelization of file system code
Matthew Curtis-Maury,Vinay Devadas,Vania Fang,Aditya Kulkarni +3 more
- 02 Nov 2016
TL;DR: The evolution of the multiprocessor software architecture employed by the Netapp® Data ONTAP® WAFL® file system is described as a case study in incrementally scaling a production storage system and results demonstrate the success of the proposed MP models in delivering scalable performance while balancing time-to-market requirements.
A Hierarchical Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous Multi-Accelerator NUMA Nodes
TL;DR: This paper proposes a hierarchical two-level data partitioning algorithm minimizing the parallel execution time of data-parallel applications on clusters of identical nodes where each node has $h$ identical nodes and proposes an extension of the algorithm for clusters of non-identical nodes.
References
Scheduling multithreaded computations by work stealing
TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.
GPU clusters for high-performance computing
Volodymyr Kindratenko,Jeremy Enos,Guochun Shi,Michael Showerman,Galen Wesley Arnold,John E. Stone,James C. Phillips,Wen-mei W. Hwu +7 more
- 16 Oct 2009
TL;DR: This paper presents efforts to address some of the challenges with building and running GPU clusters in HPC environments and touches upon such issues as balanced cluster architecture, resource sharing in a cluster environment, programming models, and applications for GPU clusters.
Performance of the decoupled ACRI-1 architecture: the perfect club
Nigel Topham,Kenneth McDougall +1 more
- 03 May 1995
TL;DR: The applicability of access and control decoupling to real-world codes is investigated and bounds for the performance of these codes are derived and it is shown that, whilst some exhibit performance roughly equivalent to that on vector computers, others exhibit considerably higher performance potential in a decoupled system.
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
TL;DR: StarPU as discussed by the authors is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.