TL;DR: P3DFFT is introduced, a popular software package which implements fast Fourier transforms in three dimensions in a highly efficient and scalable way and overcomes a well-known scalability bottleneck of three-dimensional (3D) FFT implementations by using two-dimensional domain decomposition.
Abstract: Fourier and related transforms are a family of algorithms widely employed in diverse areas of computational science, notoriously difficult to scale on high-performance parallel computers with a large number of processing elements (cores). This paper introduces a popular software package called P3DFFT which implements fast Fourier transforms (FFTs) in three dimensions in a highly efficient and scalable way. It overcomes a well-known scalability bottleneck of three-dimensional (3D) FFT implementations by using two-dimensional domain decomposition. Designed for portable performance, P3DFFT achieves excellent timings for a number of systems and problem sizes. On a Cray XT5 system P3DFFT attains 45% efficiency in weak scaling from 128 to 65,536 computational cores. Library features include Fourier and Chebyshev transforms, Fortran and C interfaces, in- and out-of-place transforms, uneven data grids, and single and double precision. P3DFFT is available as open source at http://code.google.com/p/p3dfft/. This pa...
TL;DR: It is demonstrated through experimental results on the Cray XT5 Kraken system that the DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.
Abstract: We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nested-loops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA~(Parallel Linear Algebra for Scalable Multi-core Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.
TL;DR: This paper presents a distributed data sharing and task execution framework that employs data-centric task placement to map computations from the coupled applications onto processor cores so that a large portion of the data exchanges can be performed using the intra-node shared memory.
Abstract: Emerging scientific application workflows are composed of heterogeneous coupled component applications that simulate different aspects of the physical phenomena being modeled, and that interact and exchange significant volumes of data at runtime. With the increasing performance gap between on-chip data sharing and off-chip data transfers in current systems based on multicore processors, moving large volumes of data using communication network fabric can significantly impact performance. As a result, minimizing the amount of inter-application data exchanges that are across compute nodes and use the network is critical to achieving overall application performance and system efficiency. In this paper, we investigate the in-situ execution of the coupled components of a scientific application workflow so as to maximize on-chip exchange of data. Specifically, we present a distributed data sharing and task execution framework that (1) employs data-centric task placement to map computations from the coupled applications onto processor cores so that a large portion of the data exchanges can be performed using the intra-node shared memory, (2) provides a shared space programming abstraction that supplements existing parallel programming models (e.g., message passing) with specialized one-sided asynchronous data access operators and can be used to express coordination and data exchanges between the coupled components. We also present the implementation of the framework and its experimental evaluation on the Jaguar Cray XT5 at Oak Ridge National Laboratory.
TL;DR: The scalability of two ultra-high-resolution coupled configurations on leadership class computing platforms is described, demonstrating the ability to utilize over 30,000 processor cores on a Cray XT5 system and over 60,000 cores on an IBM Blue Gene/P system to obtain climatologically relevant simulation rates.
Abstract: With the fourth release of the Community Climate System Model, the ability to perform ultra-high-resolution climate simulations is now possible, enabling eddy-resolving ocean and sea-ice models to be coupled to a finite-volume atmosphere model for a range of atmospheric resolutions. This capability was made possible by enabling the model to use large scale parallelism, which required a significant refactoring of the software infrastructure. We describe the scalability of two ultra-high-resolution coupled configurations on leadership class computing platforms. We demonstrate the ability to utilize over 30,000 processor cores on a Cray XT5 system and over 60,000 cores on an IBM Blue Gene/P system to obtain climatologically relevant simulation rates for these configurations.
TL;DR: This work examines its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMDOpteron Hex-core processor per nodes (Jaguar), and an IBM Blue Gene/P system with a single Quad- core processor (Intrepid).
Abstract: Algebraic multigrid (AMG) is a popular solver for large-scale scientific computing and an essential component of many simulation codes. AMG has shown to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can significantly deteriorate AMG's performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM Blue Gene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.