Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2022
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2022
Showing papers presented at "Parallel Computing in 2022"
Journal Article•10.1016/J.PARCO.2021.102856•
OpenMP application experiences: Porting to accelerated nodes

[...]

Seonmyeong Bak1, Colleen Bertoni2, Swen Boehm3, Reuben D. Budiardja3, Barbara Chapman4, Johannes Doerfert2, Markus Eisenbach3, Hal Finkel2, Oscar Hernandez3, Joseph Huber3, Shintaro Iwasaki2, Vivek S. Kale4, Paul R. C. Kent3, JaeHyuk Kwack2, Meifeng Lin4, Piotr Luszczek5, Ye Luo2, Buu Pham6, Swaroop Pophale3, Kiran Ravikumar1, Vivek Sarkar1, Thomas R. W. Scogland7, Shilei Tian8, Pui-Kuen Yeung1 •
Georgia Institute of Technology1, Argonne National Laboratory2, Oak Ridge National Laboratory3, Brookhaven National Laboratory4, University of Tennessee5, Iowa State University6, Lawrence Livermore National Laboratory7, Stony Brook University8
1 Mar 2022
TL;DR: In this paper, the authors report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy.
Abstract: As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.

29 citations

Proceedings Article•10.1016/J.PARCO.2021.102861•
Benchmarking the performance of irregular computations in AutoDock-GPU molecular docking

[...]

Leonardo Solis-Vasquez1, Andreas F. Tillack2, Diogo Santos-Martins2, Andreas Koch1, Scott LeGrand3, Stefano Forli2 •
Technische Universität Darmstadt1, Scripps Research Institute2, Nvidia3
1 Mar 2022
TL;DR: In this article, the authors evaluate the impact of early termination on the execution time of AutoDock-GPU and show that it is possible to achieve average runtime reductions of 50% by using these techniques.
Abstract: Irregular applications can be found in different scientific fields. In computer-aided drug design, molecular docking simulations play an important role in finding promising drug candidates. AutoDock is a software application widely used for predicting molecular interactions at close distances. It is characterized by irregular computations and long execution runtimes. In recent years, a hardware-accelerated version of AutoDock , called AutoDock-GPU , has been under active development. This work benchmarks the recent code and algorithmic enhancements incorporated into AutoDock-GPU . Particularly, we analyze the impact on execution runtime of techniques based on early termination. These enable AutoDock-GPU to explore the molecular space as necessary, while safely avoiding redundant computations. Our results indicate that it is possible to achieve average runtime reductions of 50% by using these techniques. Furthermore, a comprehensive literature review is also provided, where our work is compared to relevant approaches leveraging hardware acceleration for molecular docking.

23 citations

Journal Article•10.1016/J.PARCO.2021.102860•
Towards leveraging collective performance with the support of MPI 4.0 features in MPC

[...]

Stephane Bouhrour, Thibaut Pepin, Julien Jaeger1•
Université Paris-Saclay1
1 Mar 2022
TL;DR: In this paper, the authors describe the implementation of hardware topology splitting and communicator splitting according to the underlying MPI topology in the MPC framework and demonstrate how these two features can be combined to produce efficient topology-aware persistent collective implementations.
Abstract: Persistent collective communications and communicator splitting according to the underlying hardware topology have recently been voted in the MPI standard. Persistent semantics contains an initialization phase called only once for a specific collective operation, with subsequent recurring invocations. This opens the door to many optimizations requiring heavy setup costs to improve collective performances. Communicator topological splitting offers a standard way to design topological algorithm through the use of sub-communicators mapped to hardware hierarchical levels. Setting these communicators might be too costly to be efficient on a single collective call. However, the persistent semantics allow to create these communicators once at initialization, and use them repeatedly in the multiple collective invocations to have an efficient algorithm. In this paper, we describe the implementation of these two new MPI features in the MPC framework. We first present a naive and an optimized version of persistent collectives without topology knowledge. Then, after detailing the implementation of hardware topology splitting and the hierarchical levels supported in MPC, we showcase how these two features can be combined to produce efficient topology-aware persistent collective implementations. Experimental results show that the topology-aware algorithms built with these basic blocks offer good performances, independent of the MPI processes binding.

2 citations

Proceedings Article•10.1016/J.PARCO.2021.102863•
Reducing queuing impact in streaming applications with irregular dataflow

[...]

Stephen Timcheck1, Jeremy Buhler1•
Washington University in St. Louis1
1 Mar 2022
TL;DR: In this paper, the authors study irregular dataflow applications, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori.
Abstract: Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular dataflow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. We consider how to implement such applications on wide-SIMD architectures, such as GPUs, where different nodes of the pipeline execute cooperatively on a single processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with managing the queue and scheduling the nodes at its endpoints. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, given a pipeline with queues between each two nodes and a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. Second, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. We give an empirically useful approximation to the first problem that allows for an analytical solution and formulate a performance model for the second that directs implementation toward higher-performing strategies. We implemented our analyses and resulting optimizations in applications built using Mercator, a framework we designed to support irregular streaming applications on NVIDIA GPUs. We demonstrate that these optimizations yield meaningful performance improvements for several benchmark Mercator applications.

1 citations

Journal Article•10.1016/J.PARCO.2021.102859•
MPI detach — Towards automatic asynchronous local completion

[...]

Joachim Protze1, Marc-André Hermanns1, Matthias S. Müller1, Van Man Nguyen2, Van Man Nguyen3, Julien Jaeger4, Emmanuelle Saillard3, Patrick Carribault4, Denis Barthou2, Denis Barthou3 •
RWTH Aachen University1, University of Bordeaux2, French Institute for Research in Computer Science and Automation3, Université Paris-Saclay4
1 Mar 2022
TL;DR: In this article, a slim extension of the MPI interface is presented to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++, which allows to span task dependency graphs for asynchronous execution over the full distributed memory application.
Abstract: When aiming for large-scale parallel computing, waiting time due to network latency, synchronization, and load imbalance are the primary opponents of high parallel efficiency. A common approach to hide latency with computation is the use of non-blocking communication. In the presence of a consistent load imbalance, synchronization cost is just the visible symptom of the load imbalance. Tasking approaches as in OpenMP, TBB, OmpSs, or C++20 coroutines promise to expose a higher degree of concurrency, which can be distributed on available execution units and significantly increase load balance. Available MPI non-blocking functionality does not integrate seamlessly into such tasking parallelization. In this work, we present a slim extension of the MPI interface to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++. Using our concept allows to span task dependency graphs for asynchronous execution over the full distributed memory application. We furthermore investigate compile-time analysis necessary to transform an application using blocking MPI communication into an application integrating OpenMP tasks with our proposed MPI interface extension.

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve