Proceedings Article10.1145/1838574.1838586
Optimizing a parallel runtime system for multicore clusters: a case study
Chao Mei,Gengbin Zheng,Filippo Gioachin,Laxmikant V. Kale +3 more
- 02 Aug 2010
- pp 12
TL;DR: This paper studies several multicore performance issues on clusters using Intel, AMD and IBM processors in the context of the Charm++ runtime system, and presents the optimization techniques that overcome these performance issues.
read more
Abstract: Clusters of multicore nodes have become the most popular option for new HPC systems due to their scalability and performance/cost ratio. The complexity of programming multicore systems underscores the need for powerful and efficient runtime systems that manage resources such as threads and communication sub-systems on behalf of the applications.In this paper, we study several multicore performance issues on clusters using Intel, AMD and IBM processors in the context of the Charm++ runtime system. We then present the optimization techniques that overcome these performance issues. The techniques presented are general enough to apply to other runtime systems as well. We demonstrate the benefits of these optimizations through both synthetic benchmarks and production quality applications including NAMD and ChaNGa on several popular multicore platforms. We demonstrate performance improvement of NAMD and ChaNGa by about 20% and 10%, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems
Laércio Lima Pilla,Christiane Pousa Ribeiro,Daniel Cordeiro,Chao Mei,Abhinav Bhatele,Philippe O. A. Navaux,François Broquedis,Jean-François Méhaut,Laxmikant V. Kale +8 more
- 10 Sep 2012
TL;DR: NucoLB is introduced, a topology-aware load balancer that focuses on redistributing work while reducing communication costs among and within compute nodes and takes the asymmetric memory access costs present on NUMA multi-core compute nodes, the interconnection network overheads, and the application communication patterns into account in its balancing decisions.
Development of Load Balancer and Parallel Database Management Module
R. F. Gibadullin,I.S. Vershinin,R. Sh. Minyazev +2 more
- 15 May 2018
TL;DR: The article discusses main development aspects of the load balancer for the cluster to increase the speed of queries processing to a database running with PostgreSQL DBMS on the Windows operating system.
30
Controlling the Memory Subscription of Distributed Applications with a Task-Based Runtime System
Marc Sergent,David Goudin,Samuel Thibault,Olivier Aumage +3 more
- 23 May 2016
TL;DR: It is shown that the task paradigm allows to control the memory footprint of the application by throttling the task submission flow rate, striking a compromise between the performance benefits of anticipative task submission and the resulting memory consumption.
Realization of replication mechanism in PostgreSQL DBMS
R. F. Gibadullin,I.S. Vershinin,R. Sh. Minyazev +2 more
- 16 May 2017
TL;DR: The very essence of data replication, replication strategy, the basic methods and techniques of replication and also compliance analyses with the replication method for high-performance parallel DBMS on the cluster platform are considered.
28
Improving Parallel System Performance with a NUMA-aware Load Balancer
Laércio Lima Pilla,Christiane Pousa Ribeiro,Daniel Cordeiro,Abhinav Bhatele,Philippe O. A. Navaux,Jean-François Méhaut,Laxmikant V. Kale +6 more
- 15 Jul 2011
TL;DR: This work proposes a NUMA-aware load balancer that combines the information about the N UMA topology with the statistics captured by the Charm++ runtime system and shows improvements over existing load balancing strategies both in benchmark performance and in the time for load balancing.
References
OpenMP: an industry standard API for shared-memory programming
TL;DR: At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism) and leaves the base language unspecified.
3.8K
Cilk: An Efficient Multithreaded Runtime System
Robert D. Blumofe,Christopher F. Joerg,Bradley C. Kuszmaul,Charles E. Leiserson,Keith H. Randall,Yuli Zhou +5 more
TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.
1.7K
CHARM++: a portable concurrent object oriented system based on C++
Laxmikant V. Kale,Laxmikant V. Kale,Sanjeev Krishnan +2 more
- 01 Oct 1993
TL;DR: Charm++ is an explicitly parallel language consisting of C++ with a few extensions that provides a clear separation between sequential and parallel objects and helps one write programs that are latency-tolerant.
1K
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms
Maged M. Michael,Michael L. Scott +1 more
- 01 May 1996
TL;DR: Experiments on a 12-node SGI Challenge multiprocessor indicate that the new non-blocking queue consistently outperforms the best known alternatives; it is the clear algorithm of choice for machines that provide a universal atomic primitive (e.g., compare_and_swap or load_linked/store_conditional).
•Proceedings Article
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Becky Verastegui
- 16 Nov 2007
TL;DR: An extraordinary technical program is in store for you, and a record number of Birds-of-a-Feather submissions will provide you with highlights on a wide variety of technology and software topics, as SC07 continues in the SC conference tradition.
347