TL;DR: In this paper, the authors report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy.
Abstract: As recent enhancements to the OpenMP specification become available in its implementations, there is a need to share the results of experimentation in order to better understand the OpenMP implementation’s behavior in practice, to identify pitfalls, and to learn how the implementations can be effectively deployed in scientific codes. We report on experiences gained and practices adopted when using OpenMP to port a variety of ECP applications, mini-apps and libraries based on different computational motifs to accelerator-based leadership-class high-performance supercomputer systems at the United States Department of Energy. Additionally, we identify important challenges and open problems related to the deployment of OpenMP. Through our report of experiences, we find that OpenMP implementations are successful on current supercomputing platforms and that OpenMP is a promising programming model to use for applications to be run on emerging and future platforms with accelerated nodes.
TL;DR: In this article, the authors evaluate the impact of early termination on the execution time of AutoDock-GPU and show that it is possible to achieve average runtime reductions of 50% by using these techniques.
Abstract: Irregular applications can be found in different scientific fields. In computer-aided drug design, molecular docking simulations play an important role in finding promising drug candidates. AutoDock is a software application widely used for predicting molecular interactions at close distances. It is characterized by irregular computations and long execution runtimes. In recent years, a hardware-accelerated version of AutoDock , called AutoDock-GPU , has been under active development. This work benchmarks the recent code and algorithmic enhancements incorporated into AutoDock-GPU . Particularly, we analyze the impact on execution runtime of techniques based on early termination. These enable AutoDock-GPU to explore the molecular space as necessary, while safely avoiding redundant computations. Our results indicate that it is possible to achieve average runtime reductions of 50% by using these techniques. Furthermore, a comprehensive literature review is also provided, where our work is compared to relevant approaches leveraging hardware acceleration for molecular docking.
TL;DR: In this paper, the authors describe the implementation of hardware topology splitting and communicator splitting according to the underlying MPI topology in the MPC framework and demonstrate how these two features can be combined to produce efficient topology-aware persistent collective implementations.
Abstract: Persistent collective communications and communicator splitting according to the underlying hardware topology have recently been voted in the MPI standard. Persistent semantics contains an initialization phase called only once for a specific collective operation, with subsequent recurring invocations. This opens the door to many optimizations requiring heavy setup costs to improve collective performances. Communicator topological splitting offers a standard way to design topological algorithm through the use of sub-communicators mapped to hardware hierarchical levels. Setting these communicators might be too costly to be efficient on a single collective call. However, the persistent semantics allow to create these communicators once at initialization, and use them repeatedly in the multiple collective invocations to have an efficient algorithm. In this paper, we describe the implementation of these two new MPI features in the MPC framework. We first present a naive and an optimized version of persistent collectives without topology knowledge. Then, after detailing the implementation of hardware topology splitting and the hierarchical levels supported in MPC, we showcase how these two features can be combined to produce efficient topology-aware persistent collective implementations. Experimental results show that the topology-aware algorithms built with these basic blocks offer good performances, independent of the MPI processes binding.
TL;DR: In this paper, the authors study irregular dataflow applications, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori.
Abstract: Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular dataflow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. We consider how to implement such applications on wide-SIMD architectures, such as GPUs, where different nodes of the pipeline execute cooperatively on a single processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with managing the queue and scheduling the nodes at its endpoints. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, given a pipeline with queues between each two nodes and a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. Second, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. We give an empirically useful approximation to the first problem that allows for an analytical solution and formulate a performance model for the second that directs implementation toward higher-performing strategies. We implemented our analyses and resulting optimizations in applications built using Mercator, a framework we designed to support irregular streaming applications on NVIDIA GPUs. We demonstrate that these optimizations yield meaningful performance improvements for several benchmark Mercator applications.
TL;DR: In this article, a slim extension of the MPI interface is presented to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++, which allows to span task dependency graphs for asynchronous execution over the full distributed memory application.
Abstract: When aiming for large-scale parallel computing, waiting time due to network latency, synchronization, and load imbalance are the primary opponents of high parallel efficiency. A common approach to hide latency with computation is the use of non-blocking communication. In the presence of a consistent load imbalance, synchronization cost is just the visible symptom of the load imbalance. Tasking approaches as in OpenMP, TBB, OmpSs, or C++20 coroutines promise to expose a higher degree of concurrency, which can be distributed on available execution units and significantly increase load balance. Available MPI non-blocking functionality does not integrate seamlessly into such tasking parallelization. In this work, we present a slim extension of the MPI interface to allow seamless integration of non-blocking communication with available concepts of asynchronous execution in OpenMP and C++. Using our concept allows to span task dependency graphs for asynchronous execution over the full distributed memory application. We furthermore investigate compile-time analysis necessary to transform an application using blocking MPI communication into an application integrating OpenMP tasks with our proposed MPI interface extension.