Cray XT5

Topic Tools

Papers published on a yearly basis

Papers

P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions

[...]

15 Aug 2012-SIAM Journal on Scientific Computing

TL;DR: P3DFFT is introduced, a popular software package which implements fast Fourier transforms in three dimensions in a highly efficient and scalable way and overcomes a well-known scalability bottleneck of three-dimensional (3D) FFT implementations by using two-dimensional domain decomposition.

...read moreread less

Abstract: Fourier and related transforms are a family of algorithms widely employed in diverse areas of computational science, notoriously difficult to scale on high-performance parallel computers with a large number of processing elements (cores). This paper introduces a popular software package called P3DFFT which implements fast Fourier transforms (FFTs) in three dimensions in a highly efficient and scalable way. It overcomes a well-known scalability bottleneck of three-dimensional (3D) FFT implementations by using two-dimensional domain decomposition. Designed for portable performance, P3DFFT achieves excellent timings for a number of systems and problem sizes. On a Cray XT5 system P3DFFT attains 45% efficiency in weak scaling from 128 to 65,536 computational cores. Library features include Fourier and Chebyshev transforms, Fortran and C interfaces, in- and out-of-place transforms, uneven data grids, and single and double precision. P3DFFT is available as open source at http://code.google.com/p/p3dfft/. This pa...

...read moreread less

253 citations

Proceedings Article•10.1109/IPDPS.2011.299•

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

[...]

George Bosilca¹, Aurelien Bouteiller¹, Anthony Danalis¹, Mathieu Faverge¹, Azzam Haidar¹, Thomas Herault¹, Jakub Kurzak¹, Julien Langou², Pierre Lemarinier¹, Hatem Ltaief¹, Piotr Luszczek¹, Asim YarKhan¹, Jack Dongarra¹ - Show less +9 more•Institutions (2)

University of Tennessee¹, University of Colorado Denver²

16 May 2011

TL;DR: It is demonstrated through experimental results on the Cray XT5 Kraken system that the DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.

...read moreread less

Abstract: We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nested-loops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA~(Parallel Linear Algebra for Scalable Multi-core Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.

...read moreread less

185 citations

Proceedings Article•10.1109/IPDPS.2012.122•

Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform

[...]

Fan Zhang¹, Ciprian Docan¹, Manish Parashar¹, Scott Klasky², Norbert Podhorszki², Hasan Abbasi² - Show less +2 more•Institutions (2)

Rutgers University¹, Oak Ridge National Laboratory²

21 May 2012

TL;DR: This paper presents a distributed data sharing and task execution framework that employs data-centric task placement to map computations from the coupled applications onto processor cores so that a large portion of the data exchanges can be performed using the intra-node shared memory.

...read moreread less

Abstract: Emerging scientific application workflows are composed of heterogeneous coupled component applications that simulate different aspects of the physical phenomena being modeled, and that interact and exchange significant volumes of data at runtime. With the increasing performance gap between on-chip data sharing and off-chip data transfers in current systems based on multicore processors, moving large volumes of data using communication network fabric can significantly impact performance. As a result, minimizing the amount of inter-application data exchanges that are across compute nodes and use the network is critical to achieving overall application performance and system efficiency. In this paper, we investigate the in-situ execution of the coupled components of a scientific application workflow so as to maximize on-chip exchange of data. Specifically, we present a distributed data sharing and task execution framework that (1) employs data-centric task placement to map computations from the coupled applications onto processor cores so that a large portion of the data exchanges can be performed using the intra-node shared memory, (2) provides a shared space programming abstraction that supplements existing parallel programming models (e.g., message passing) with specialized one-sided asynchronous data access operators and can be used to express coordination and data exchanges between the coupled components. We also present the implementation of the framework and its experimental evaluation on the Jaguar Cray XT5 at Oak Ridge National Laboratory.

...read moreread less

87 citations

Journal Article•10.1177/1094342012436965•

Computational performance of ultra-high-resolution capability in the Community Earth System Model

[...]

John M. Dennis¹, Mariana Vertenstein¹, Patrick H. Worley², Arthur A. Mirin³, Anthony Craig¹, Robert Jacob⁴, Sheri Mickelson⁴ - Show less +3 more•Institutions (4)

National Center for Atmospheric Research¹, Oak Ridge National Laboratory², Lawrence Livermore National Laboratory³, Argonne National Laboratory⁴

1 Feb 2012

TL;DR: The scalability of two ultra-high-resolution coupled configurations on leadership class computing platforms is described, demonstrating the ability to utilize over 30,000 processor cores on a Cray XT5 system and over 60,000 cores on an IBM Blue Gene/P system to obtain climatologically relevant simulation rates.

...read moreread less

Abstract: With the fourth release of the Community Climate System Model, the ability to perform ultra-high-resolution climate simulations is now possible, enabling eddy-resolving ocean and sea-ice models to be coupled to a finite-volume atmosphere model for a range of atmospheric resolutions. This capability was made possible by enabling the model to use large scale parallelism, which required a significant refactoring of the software infrastructure. We describe the scalability of two ultra-high-resolution coupled configurations on leadership class computing platforms. We demonstrate the ability to utilize over 30,000 processor cores on a Cray XT5 system and over 60,000 cores on an IBM Blue Gene/P system to obtain climatologically relevant simulation rates for these configurations.

...read moreread less

86 citations

Proceedings Article•10.1109/IPDPS.2011.35•

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures

[...]

Allison H. Baker¹, Todd Gamblin¹, Martin Schulz¹, Ulrike Meier Yang¹•Institutions (1)

Lawrence Livermore National Laboratory¹

16 May 2011

TL;DR: This work examines its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMDOpteron Hex-core processor per nodes (Jaguar), and an IBM Blue Gene/P system with a single Quad- core processor (Intrepid).

...read moreread less

Abstract: Algebraic multigrid (AMG) is a popular solver for large-scale scientific computing and an essential component of many simulation codes. AMG has shown to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can significantly deteriorate AMG's performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM Blue Gene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.

...read moreread less

72 citations

...

Expand

Topic Tools

Papers published on a yearly basis

Papers

P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform

Computational performance of ultra-high-resolution capability in the Community Earth System Model

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures

Related Topics (5)

Performance Metrics

No. of papers in the topic in previous years
Year	Papers
2015	1
2014	3
2013	2
2012	17
2011	22
2010	22