TL;DR: The paper describes some very early experiments on new architecturesthat support the hybrid programming model, and demonstrates the use of a tool that can examine the detailed sequence of events in a hybridprogram and illustrates that a hybrid computation might not always proceed as expected.
Abstract: The paper describes some very early experiments on new architecturesthat support the hybrid programming model. Our results arepromising in that OpenMP threads interact with MPI as desired, allowingOpenMP-agnostic tools to be used. We explore three environments:a "typical" Linux cluster, a new large-scale machine from SiCortex, andthe new IBM BG/P, which have quite different compilers and runtimesystems for both OpenMP and MPI. We look at a few simple, diagnosticprograms, and one "application-like" test program. We demonstrate theuse of a tool that can examine the detailed sequence of events in a hybridprogram and illustrate that a hybrid computation might not alwaysproceed as expected.
TL;DR: A new communication network (the SiCortex fabric) employed in theSiCortex SC5832 integrated cluster system is described, which provides slow latency, high bandwidth, reliable communication between the 972 nodes of the SiC Vortex system.
Abstract: Much of high performance technical computing has moved from shared memory architectures to message based cluster systems The development and wide adoption of the MPI parallel programming model has hastened this transition Parallel scaling, however, is frequently limited by the inefficient communication hardware commonly found in commodity based clusters This paper describes a new communication network (the SiCortex fabric) employed in the SiCortex SC5832 integrated cluster system The fabric switch and communications controller are integrated with a single-chip multiprocessor node and provides three point-to-point links per node chip The resulting design provide slow latency, high bandwidth, reliable communication between the 972 nodes of the SiCortex system
TL;DR: The SiCortex cluster systems implement a high-bandwidth, low-latency interconnect, including zero-copy data transfers and user-level networking, and uses optimistic virtual memory registration without page locking.
Abstract: The SiCortex cluster systems implement a high-bandwidth, low-latency interconnect. We describe how the SiCortex systems implement RDMA, including zero-copy data transfers and user-level networking. The system uses optimistic virtual memory registration without page locking. Finally, we provide preliminary performance results.
TL;DR: The challenges of I/O performance encountered in making this model practical are presented, and results are shown using both micro-benchmarks and real applications on two large-scale systems, the BG/P and the SiCortex SC5832.
Abstract: Our work addresses the enabling of the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications, on large-scale systems. This approach allows new-and potentially far larger-classes of application to leverage systems such as the IBM Blue Gene/P supercomputer and similar emerging petascale architectures. We present here the challenges of I/O performance encountered in making this model practical, and show results using both micro-benchmarks and real applications on two large-scale systems, the BG/P and the SiCortex SC5832. Our preliminary benchmarks show that we can scale to 4096 processors on the Blue Gene/P and 5832 processors on the SiCortex with high efficiency, and can achieve thousands of tasks/sec sustained execution rates for parallel workloads of ordinary serial applications. We measured applications from two domains, economic energy modeling and molecular dynamics.
TL;DR: This work investigates the potential of the SiCortex platform for numerical simulation by analyzing the performance of a set of elementary benchmarks and two fluid dynamics applications executed on the SC072 and the SC5832 systems and shows that an analysis of the CFD applications shows that theSiCortex systems makes it possible to significantly reduce the energy consumption compared to a commodity cluster.
Abstract: Most cluster systems used today for high-performance scientific computing are built from off-the-shelf standard components placed in racks. SiCortex has chosen a different strategy and offers a line of integrated cluster machines based on a customized low-frequency MIPS multicore processor and a specialized network fabric. We investigate the potential of the SiCortex platform for numerical simulation by analyzing the performance of a set of elementary benchmarks and two fluid dynamics applications executed on the SC072 and the SC5832 systems. The elementary benchmarks quantify the performance in terms of computation rate, memory bandwidth and communication latency. The fluid dynamics applications provide insight into how well existing scientific code performs on the system. The results are compared to those obtained on a commodity cluster with Intel Xeon cores and Infiniband interconnect. The focus of the evaluation is computational performance, but we also consider the energy consumption for all three machines. Our results indicate that while the SiCortex systems might be well suitable for applications that can be parallelized to a very fine level, they are outperformed by commodity clusters when this is not the case. However, an analysis of the CFD applications shows that the SiCortex systems makes it possible to significantly reduce the energy consumption compared to a commodity cluster.