Topic

Blue Waters

About: Blue Waters is a research topic. Over the lifetime, 62 publications have been published within this topic receiving 1506 citations.

...read moreread less

Topic Tools

Find unexplored research gaps

Generate a literature review

Explore related concepts

Papers published on a yearly basis

Papers

Journal Article•10.1103/PHYSREVD.92.094502•

Coupled ππ, KK¯ scattering in P-wave and the ρ resonance from lattice QCD

[...]

David J. Wilson¹, Raúl A. Briceño¹, Jozef J. Dudek¹, Robert Edwards¹, Christopher E. Thomas² - Show less +1 more•Institutions (2)

Thomas Jefferson National Accelerator Facility¹, University of Cambridge²

09 Jul 2015-Physical Review D

TL;DR: In this paper, the authors present the results of the Hadron Spectrum Collaboration (HWC) on propagators for the Blue Waters sustained-petascale computing project at the University of Illinois at Urbana-Champaign.

...read moreread less

Abstract: We thank our colleagues within the Hadron Spectrum Collaboration, and in particular, thank B´alint Jo´o for his help. The software codes Chroma [43], QUDA [34, 35], QPhiX [44], and QOPQDP [32, 33] were used to compute the propagators required for this project. The contractions were performed on clusters at Jefferson Laboratory under the USQCD Initiative and the LQCD ARRA project. This research was supported in part under an ALCC award, and used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC05-00OR22725. This research is also part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This work is also part of the PRAC “Lattice QCD on Blue Waters”. This research used resources of the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC02-05CH11231. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. Gauge configurations were generated using resources awarded from the U.S. Department of Energy INCITE program at Oak Ridge National Lab, and also resources awarded at NERSC. RAB, RGE and JJD acknowledge support from U.S. Department of Energy contract DE-AC05-06OR23177, under which Jefferson Science Associates, LLC, manages and operates Jefferson Laboratory. JJD acknowledges support from the U.S. Department of Energy Early Career award contract DESC0006765. CET acknowledges partial support from the U.K. Science and Technology Facilities Council [grant number ST/L000385/1].

...read moreread less

230 citations

Proceedings Article•10.1109/DSN.2014.62•

Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters

[...]

Catello Di Martino¹, Zbigniew Kalbarczyk¹, Ravishankar K. Iyer¹, Fabio Baccanico², Joseph Fullop, William Kramer - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Naples Federico II²

23 Jun 2014

TL;DR: An analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign, based on both manual failure reports and automatically generated event logs collected over 261 days finds hardware is not the main cause of system downtime.

...read moreread less

Abstract: This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.

...read moreread less

229 citations

Proceedings Article•10.1109/SC.2014.18•

The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications

[...]

Anthony Agelastos¹, Benjamin A. Allan¹, Jim Brandt¹, paul cassella², Jeremy Enos³, Joshi Fullop³, Ann C. Gentile¹, Steve Monk¹, Nichamon Naksinehaboon, Jeff Ogden¹, Mahesh Rajan¹, Michael Showerman³, Joel O. Stevenson¹, Narate Taerat, Thomas Tucker - Show less +11 more•Institutions (3)

Sandia National Laboratories¹, Cray², University of Illinois at Urbana–Champaign³

16 Nov 2014

TL;DR: The Lightweight Distributed Metric Service is introduced for scalable, lightweight monitoring of large scale computing systems and applications and its motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters.

...read moreread less

Abstract: Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters We address monitoring overhead and impact on application performance and provide illustrative profiling results

...read moreread less

222 citations

Journal Article•10.1016/J.JOCS.2015.12.007•

Alya: Multiphysics engineering simulation toward exascale

[...]

Mariano Vázquez¹, Mariano Vázquez², Guillaume Houzeaux¹, Seid Koric³, Antoni Artigues¹, Jazmin Aguado-Sierra¹, Ruth Arís¹, Daniel Mira¹, Hadrien Calmet¹, Fernando M. Cucchietti¹, Herbert Owen¹, Ahmed Taha³, Evan Dering Burness³, José María Cela¹, Mateo Valero¹ - Show less +11 more•Institutions (3)

Barcelona Supercomputing Center¹, Spanish National Research Council², University of Illinois at Urbana–Champaign³

01 May 2016-Journal of Computational Science

TL;DR: Alya's main features are introduced and focus particularly on its solvers and the performance up to 100.000 processors in Blue Waters, the NCSA supercomputer with selected multi-physics tests that are representative of the engineering world.

...read moreread less

202 citations

Journal Article•10.1002/JCC.25382•

Parallelization of CPPTRAJ enables large scale analysis of molecular dynamics trajectory data.

[...]

Daniel R. Roe¹, Thomas E. Cheatham²•Institutions (2)

National Institutes of Health¹, University of Utah²

30 Sep 2018-Journal of Computational Chemistry

TL;DR: CPPTRAJ now has two additional levels of message passing (MPI) parallelism involving both across‐trajectory processing and across‐ensemble processing, leading to significant speed ups in data analysis of large datasets on the NCSA Blue Waters supercomputer by better leveraging the many available nodes and its parallel file system.

...read moreread less

Abstract: Advances in biomolecular simulation methods and access to large scale computer resources have led to a massive increase in the amount of data generated. The key enablers have been optimization and parallelization of the simulation codes. However, much of the software used to analyze trajectory data from these simulations is still run in serial, or in some cases many threads via shared memory. Here, we describe the addition of multiple levels of parallel trajectory processing to the molecular dynamics simulation analysis software CPPTRAJ. In addition to the existing OpenMP shared-memory parallelism, CPPTRAJ now has two additional levels of message passing (MPI) parallelism involving both across-trajectory processing and across-ensemble processing. All three levels of parallelism can be simultaneously active, leading to significant speed ups in data analysis of large datasets on the NCSA Blue Waters supercomputer by better leveraging the many available nodes and its parallel file system. © 2018 Wiley Periodicals, Inc.

...read moreread less

129 citations

...

Expand

Performance Metrics

Papers

292

Citations

No. of papers in the topic in previous years
Year	Papers
2021	1
2020	1
2019	2
2018	8
2017	10
2016	9

Blue Waters

Topic Tools

Papers published on a yearly basis

Papers

Coupled ππ, KK¯ scattering in P-wave and the ρ resonance from lattice QCD

Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters

The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications

Alya: Multiphysics engineering simulation toward exascale

Parallelization of CPPTRAJ enables large scale analysis of molecular dynamics trajectory data.

Related Topics (5)

Performance Metrics