Proceedings Article10.1109/ICPP.2009.73
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
Xiangyong Ouyang,Karthik Gopalakrishnan,Dhabaleswar K. Panda +2 more
- 22 Sep 2009
- pp 34-41
TL;DR: This work explores the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library, and proposes to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes.
read more
Abstract: Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size due to constraints within the file system. Furthermore, with the advent of multi-core architecture, the situation is aggravated due to larger number of processes running on the same node, trying to checkpoint simultaneously. This results in increased number of file writes at the time of checkpointing which leads to performance degradation. As a result, deployment of Checkpoint/Restart mechanisms for large scale parallel applications is limited. In this work, we explore the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library. Our profiling of the checkpoints for the NAS parallel benchmarks revealed a large number of small file writes interspersed with large writes. Based on these observation we propose to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes. At the aggregation threshold of 512KB, the implementation of our design in BLCR shows improvements from 27% to 32% over the original BLCR in terms of time cost to checkpoint an MPI application.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
McrEngine: a scalable checkpointing system using data-aware aggregation and compression
Tanzima Islam,Kathryn Mohror,Saurabh Bagchi,Adam Moody,Bronis R. de Supinski,Rudolf Eigenmann +5 more
- 10 Nov 2012
TL;DR: MCRENGINE as discussed by the authors aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them.
Enhancing Checkpoint Performance with Staging IO and SSD
Xiangyong Ouyang,Sonya Marcarelli,Dhabaleswar K. Panda +2 more
- 03 May 2010
TL;DR: A new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers is proposed, which achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers.
RDMA-Based Job Migration Framework for MPI over InfiniBand
Xiangyong Ouyang,Sonya Marcarelli,Raghunath Rajachandrasekar,Dhabaleswar K. Panda +3 more
- 20 Sep 2010
TL;DR: This paper enhances the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme that transfers the processes running on a health-deteriorating node to a healthy spare node, and resumes these processes from the spare node.
28
Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture
Xiangyong Ouyang,Karthik Gopalakrishnan,Tejus Gangadharappa,Dhabaleswar K. Panda +3 more
- 01 Dec 2009
TL;DR: The Write Aggregation with Dynamic Buffer and Interleaving scheme is proposed to reduce the overhead related to checkpoint creation by aggregating all checkpoint writes into a dynamic buffer pool and overlapping the application progress with the file writes to significantly reduce checkpoint creation overhead.
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
Xiangyong Ouyang,Raghunath Rajachandrasekar,Xavier Besseron,Hao Wang,Jian Huang,Dhabaleswar K. Panda +5 more
- 13 Sep 2011
TL;DR: This paper proposes a new filesystem, named Checkpoint-Restart File system (CRFS), which is a lightweight user-level filesystem based on FUSE, which is the first such portable and light-weight filesystem designed for generic Checkpoint/Restart data.
References
Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
Paul Hargrove,Jason Duell +1 more
- 01 Sep 2006
TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
George Bosilca,Aurelien Bouteiller,Franck Cappello,Samir Djilali,Gilles Fedak,Cécile Germain,Thomas Herault,Pierre Lemarinier,Oleg Lodygensky,Frédéric Magniette,Vincent Neri,Anton Selikhov +11 more
- 16 Nov 2002
TL;DR: This work presents MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging, and presents a detailed performance evaluation of every component and its global performance for non-trivial parallel applications.
338
The design and implementation of Berkeley Lab's linuxcheckpoint/restart
TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
Joshua Hursey,Jeffrey M. Squyres,T.I. Mattox,Andrew Lumsdaine +3 more
- 26 Mar 2007
TL;DR: The design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project is presented and the framework is meant to be extensible and to encourage experimentation of alternative techniques within a production quality MPI implementation.
Architectural Requirements and Scalability of the NAS Parallel Benchmarks
Frederick C. Wong,Richard Martin,Remzi H. Arpaci-Dusseau,David E. Culler +3 more
- 01 Jan 1999
TL;DR: It is shown that the communication protocols used by MPI runtime library are influential to the communication performance in applications, and that the benchmark codes have a wide spectrum of communication requirements.
126