TL;DR: In this paper, the authors present the results of the Hadron Spectrum Collaboration (HWC) on propagators for the Blue Waters sustained-petascale computing project at the University of Illinois at Urbana-Champaign.
Abstract: We thank our colleagues within the Hadron Spectrum Collaboration, and in particular, thank B´alint Jo´o for his help. The software codes Chroma [43], QUDA [34, 35], QPhiX [44], and QOPQDP [32, 33] were used to compute the propagators required for this project. The contractions were performed on clusters at Jefferson Laboratory under the USQCD Initiative and the LQCD ARRA project. This research was supported in part under an ALCC award, and used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC05-00OR22725. This research is also part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This work is also part of the PRAC “Lattice QCD on Blue Waters”. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC02-05CH11231. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. Gauge configurations were generated using resources awarded from the U.S. Department of Energy INCITE program at Oak Ridge National Lab, and also resources awarded at NERSC. RAB, RGE and JJD acknowledge support from U.S. Department of Energy contract DE-AC05-06OR23177, under which Jefferson Science Associates, LLC, manages and operates Jefferson Laboratory. JJD acknowledges support from the U.S. Department of Energy Early Career award contract DESC0006765. CET acknowledges partial support from the U.K. Science and Technology Facilities Council [grant number ST/L000385/1].
TL;DR: An analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign, based on both manual failure reports and automatically generated event logs collected over 261 days finds hardware is not the main cause of system downtime.
Abstract: This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.
TL;DR: The Lightweight Distributed Metric Service is introduced for scalable, lightweight monitoring of large scale computing systems and applications and its motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters.
Abstract: Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters We address monitoring overhead and impact on application performance and provide illustrative profiling results
TL;DR: Alya's main features are introduced and focus particularly on its solvers and the performance up to 100.000 processors in Blue Waters, the NCSA supercomputer with selected multi-physics tests that are representative of the engineering world.
TL;DR: CPPTRAJ now has two additional levels of message passing (MPI) parallelism involving both across‐trajectory processing and across‐ensemble processing, leading to significant speed ups in data analysis of large datasets on the NCSA Blue Waters supercomputer by better leveraging the many available nodes and its parallel file system.