TL;DR: The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns, and proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden.
Abstract: Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of parallel applications through reduction of network contention.
Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration.
The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention.
Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative for machines of the future.
TL;DR: This paper showcases the diverse functionalities as well as scalability of OpenAtom via performance case studies, with focus on the recent additions and improvements to Open atom.
Abstract: The complex interplay of tightly coupled, but disparate, computation and communication operations poses several challenges for simulating atomic scale dynamics on multi-petaflops architectures. OpenAtom addresses these challenges by exploiting overdecomposition and asynchrony in Charm++, and scales to thousands of cores for realistic scientific systems with only a few hundred atoms. At the same time, it supports several interesting ab-initio molecular dynamics simulation methods including the Car-Parrinello method, Born-Oppenheimer method, k-points, parallel tempering, and path integrals. This paper showcases the diverse functionalities as well as scalability of OpenAtom via performance case studies, with focus on the recent additions and improvements to OpenAtom. In particular, we study a metal organic framework (MOF) that consists of 424 atoms and is being explored as a candidate for a hydrogen storage material. Simulations of this system are scaled to large core counts on Cray XE6 and IBM Blue Gene/Q systems, and time per step as low as \(1.7\,s\) is demonstrated for simulating path integrals with 32-beads of MOF on 262,144 cores of Blue Gene/Q.
TL;DR: Topology aware mapping is presented as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance on large machines and improve overall performance and scaling.
Abstract: Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. In this paper, we present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance.
Our methodology is facilitated by the idea of object-based decomposition used in Charm++ which separates the processes of decomposition from mapping of computation to processors and allows a more flexible mapping based on communication patterns between objects. Exploiting this and the topology of the allocated job partition, we present mapping strategies for a production code, OpenAtom to improve overall performance and scaling. OpenAtom presents complex communication scenarios of interaction involving multiple groups of objects and makes the mapping task a challenge. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
TL;DR: A uniform API which provides topology information on 3D tori like IBM Blue Gene and Cray XT machines is presented and techniques to use this API to improve performance are presented.
Abstract: Optimal network performance is critical to efficient parallel scaling for communication-bound applications on large machines. With wormhole routing, no-load latencies do not increase significantly with number of hops traveled. Yet, we, and others have recently shown that in presence of contention, message latencies can grow substantially large. Hence task mapping strategies should take the topology of the machine into account on large machines. This poster presents a uniform API which provides topology information on 3D tori like IBM Blue Gene and Cray XT machines. We present techniques to use this API to improve performance. The API can be used by user-level codes to obtain information about allocated partitions at runtime which is essential for mapping.We motivate why it is important to consider network topology, using a simple 3D Stencil kernel. We then present mapping strategies for a production code, OpenAtom, running on three-dimensional torus and mesh topologies. OpenAtom presents complex communication scenarios of interaction between multiple groups of objects. Results are presented in the context of 3D Stencil and OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
TL;DR: This work presents CkDirect, an interface for one-sided communication in the message driven Charm++ runtime system, and describes the interface as well as its implementations on two different interconnects: Infiniband and Blue Gene/P.
Abstract: A significant fraction of parallel scientific codes are iterative with barriers between iterations or even between phases of the same iteration. The sender of a message is assured that the receiver is executing exactly the same iteration or phase. This opens up the opportunity to use one-sided communication without synchronization, explicit or implicit, between the sender and receiver of every message. The synchronization inherent in the application is sufficient to ensure correctness. We present CkDirect, an interface for such one-sided communication in the message driven Charm++ runtime system. CkDirect helps avoid unnecessary synchronization and message copying as well as scheduling overhead in iterative scientific codes. We describe the interface as well as its implementations on two different interconnects: Infiniband and Blue Gene/P. We evaluate CkDirect through a micro-benchmark, two simple scientific codes: stencil computation and matrix multiplication, as well as a full fledged quantum chemistry application called OpenAtom.