Top 5 papers presented at Parallel Computing in 2022

Journal Article•10.1016/J.PARCO.2021.102856•

OpenMP application experiences: Porting to accelerated nodes

[...]

Seonmyeong Bak¹, Colleen Bertoni², Swen Boehm³, Reuben D. Budiardja³, Barbara Chapman⁴, Johannes Doerfert², Markus Eisenbach³, Hal Finkel², Oscar Hernandez³, Joseph Huber³, Shintaro Iwasaki², Vivek S. Kale⁴, Paul R. C. Kent³, JaeHyuk Kwack², Meifeng Lin⁴, Piotr Luszczek⁵, Ye Luo², Buu Pham⁶, Swaroop Pophale³, Kiran Ravikumar¹, Vivek Sarkar¹, Thomas R. W. Scogland⁷, Shilei Tian⁸, Pui-Kuen Yeung¹ - Show less +20 more•Institutions (8)

Georgia Institute of Technology¹, Argonne National Laboratory², Oak Ridge National Laboratory³, Brookhaven National Laboratory⁴, University of Tennessee⁵, Iowa State University⁶, Lawrence Livermore National Laboratory⁷, Stony Brook University⁸

1 Mar 2022

TL;DR: In this paper, the authors report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy.

...read moreread less

Abstract: As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.

...read moreread less

29 citations

Proceedings Article•10.1016/J.PARCO.2021.102861•

Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking

[...]

Leonardo Solis-Vasquez¹, Andreas F. Tillack², Diogo Santos-Martins², Andreas Koch¹, Scott LeGrand³, Stefano Forli² - Show less +2 more•Institutions (3)

Technische Universität Darmstadt¹, Scripps Research Institute², Nvidia³

1 Mar 2022

TL;DR: In this article, the authors evaluate the impact of early termination on the execution time of AutoDock-GPU and show that it is possible to achieve average runtime reductions of 50% by using these techniques.

...read moreread less

Abstract: Irregular applications can be found in different scientific fields. In computer-aided drug design, molecular docking simulations play an important role in finding promising drug candidates. AutoDock is a software application widely used for predicting molecular interactions at close distances. It is characterized by irregular computations and long execution runtimes. In recent years, a hardware-accelerated version of AutoDock , called AutoDock-GPU , has been under active development. This work benchmarks the recent code and algorithmic enhancements incorporated into AutoDock-GPU . Particularly, we analyze the impact on execution runtime of techniques based on early termination. These enable AutoDock-GPU to explore the molecular space as necessary, while safely avoiding redundant computations. Our results indicate that it is possible to achieve average runtime reductions of 50% by using these techniques. Furthermore, a comprehensive literature review is also provided, where our work is compared to relevant approaches leveraging hardware acceleration for molecular docking.

...read moreread less

23 citations

Journal Article•10.1016/J.PARCO.2021.102860•

Towards leveraging collective performance with the support of MPI 4.0 features in MPC

[...]

Stephane Bouhrour, Thibaut Pepin, Julien Jaeger¹•Institutions (1)

Université Paris-Saclay¹

1 Mar 2022

TL;DR: In this paper, the authors describe the implementation of hardware topology splitting and communicator splitting according to the underlying MPI topology in the MPC framework and demonstrate how these two features can be combined to produce efficient topology-aware persistent collective implementations.

...read moreread less

Abstract: Persistent collective communications and communicator splitting according to the underlying hardware topology have recently been voted in the MPI standard. Persistent semantics contains an initialization phase called only once for a specific collective operation, with subsequent recurring invocations. This opens the door to many optimizations requiring heavy setup costs to improve collective performances. Communicator topological splitting offers a standard way to design topological algorithm through the use of sub-communicators mapped to hardware hierarchical levels. Setting these communicators might be too costly to be efficient on a single collective call. However, the persistent semantics allow to create these communicators once at initialization, and use them repeatedly in the multiple collective invocations to have an efficient algorithm. In this paper, we describe the implementation of these two new MPI features in the MPC framework. We first present a naive and an optimized version of persistent collectives without topology knowledge. Then, after detailing the implementation of hardware topology splitting and the hierarchical levels supported in MPC, we showcase how these two features can be combined to produce efficient topology-aware persistent collective implementations. Experimental results show that the topology-aware algorithms built with these basic blocks offer good performances, independent of the MPI processes binding.

...read moreread less

2 citations

Proceedings Article•10.1016/J.PARCO.2021.102863•

Reducing queuing impact in streaming applications with irregular dataflow

[...]

Stephen Timcheck¹, Jeremy Buhler¹•Institutions (1)

Washington University in St. Louis¹

1 Mar 2022

TL;DR: In this paper, the authors study irregular dataflow applications, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori.

...read moreread less

Abstract: Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular dataflow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. We consider how to implement such applications on wide-SIMD architectures, such as GPUs, where different nodes of the pipeline execute cooperatively on a single processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with managing the queue and scheduling the nodes at its endpoints. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, given a pipeline with queues between each two nodes and a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. Second, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. We give an empirically useful approximation to the first problem that allows for an analytical solution and formulate a performance model for the second that directs implementation toward higher-performing strategies. We implemented our analyses and resulting optimizations in applications built using Mercator, a framework we designed to support irregular streaming applications on NVIDIA GPUs. We demonstrate that these optimizations yield meaningful performance improvements for several benchmark Mercator applications.

...read moreread less

1 citations

Journal Article•10.1016/J.PARCO.2021.102859•

MPI detach — Towards automatic asynchronous local completion

[...]

Joachim Protze¹, Marc-André Hermanns¹, Matthias S. Müller¹, Van Man Nguyen², Van Man Nguyen³, Julien Jaeger⁴, Emmanuelle Saillard³, Patrick Carribault⁴, Denis Barthou², Denis Barthou³ - Show less +6 more•Institutions (4)

RWTH Aachen University¹, University of Bordeaux², French Institute for Research in Computer Science and Automation³, Université Paris-Saclay⁴

1 Mar 2022

TL;DR: In this article, a slim extension of the MPI interface is presented to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++, which allows to span task dependency graphs for asynchronous execution over the full distributed memory application.

...read moreread less

Abstract: When aiming for large-scale parallel computing, waiting time due to network latency, synchronization, and load imbalance are the primary opponents of high parallel efficiency. A common approach to hide latency with computation is the use of non-blocking communication. In the presence of a consistent load imbalance, synchronization cost is just the visible symptom of the load imbalance. Tasking approaches as in OpenMP, TBB, OmpSs, or C++20 coroutines promise to expose a higher degree of concurrency, which can be distributed on available execution units and significantly increase load balance. Available MPI non-blocking functionality does not integrate seamlessly into such tasking parallelization. In this work, we present a slim extension of the MPI interface to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++. Using our concept allows to span task dependency graphs for asynchronous execution over the full distributed memory application. We furthermore investigate compile-time analysis necessary to transform an application using blocking MPI communication into an application integrating OpenMP tasks with our proposed MPI interface extension.

...read moreread less

Showing papers presented at "Parallel Computing in 2022"

OpenMP application experiences: Porting to accelerated nodes

Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking

Towards leveraging collective performance with the support of MPI 4.0 features in MPC

Reducing queuing impact in streaming applications with irregular dataflow

MPI detach — Towards automatic asynchronous local completion