TL;DR: Cray Research's massively parallel processing (MPP) philosophy is presented, together with a brief description of the design of the Cray T3D, the first MPP designed by Cray Research, and the 3-D torus interprocessor interconnect is discussed.
Abstract: The authors present Cray Research's massively parallel processing (MPP) philosophy, together with a brief description of the design of the Cray T3D, the first MPP designed by Cray Research. They give a brief overview of the important features of the Cray T3D, including the 3-D torus interprocessor interconnect. They discuss in more detail the motivation for the 3-D torus interconnect. Using a very simple capacity model of network performance, they show how three dimensions provide a solid balance between locality and scalability up to thousands of nodes. >
TL;DR: The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers), which provide a rich set of atomic memory operations and a flexible, user-level messaging facility.
Abstract: This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization.The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/eureka networks that can be arbitrarily embedded into the 3D torus interconnect.
TL;DR: A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.
Abstract: A new architecture with a six-dimensional mesh/torus topology achieves highly scalable and fault-tolerant interconnection networks for large-scale supercomputers that can exceed 10 petaflops.
TL;DR: It is shown that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q.
Abstract: Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Cray's proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In this work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. We also discuss the causes of variability in performance across the systems and quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XE6 and InfiniBand Clusters in comparison to Blue Gene/Q.
TL;DR: The Desmos cluster that consists of 32 hybrid nodes connected by a low-latency high-bandwidth torus interconnect is described, which verifies its ability to unite MPP systems speeding-up effectively MPI-based applications.
Abstract: In this paper, we describe the Desmos cluster that consists of 32 hybrid nodes connected by a low-latency high-bandwidth torus interconnect. This cluster is aimed at cost-effective classical molecular dynamics calculations. We present strong scaling benchmarks for GROMACS, LAMMPS and VASP and compare the results with other HPC systems. This cluster serves as a test bed for the Angara interconnect that supports 3D and 4D torus network topologies, and verifies its ability to unite MPP systems speeding-up effectively MPI-based applications. We describe the interconnect presenting typical MPI benchmarks.