Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

doi:10.1109/ICPP.2009.73

Proceedings Article10.1109/ICPP.2009.73

Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

Xiangyong Ouyang, +2 more

- 22 Sep 2009

- pp 34-41

18

TL;DR: This work explores the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library, and proposes to optimize checkpoint creation by classifying checkpoint file writes into small writes, medium writes and large writes based on their size of data to write, and use write aggregation to optimize the small and medium writes.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.5555/2388996.2389020

McrEngine: a scalable checkpointing system using data-aware aggregation and compression

Tanzima Islam, +5 more

- 10 Nov 2012

TL;DR: MCRENGINE as discussed by the authors aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them.

...read moreread less

81

Proceedings Article•10.1109/SNAPI.2010.10

Enhancing Checkpoint Performance with Staging IO and SSD

Xiangyong Ouyang, +2 more

- 03 May 2010

TL;DR: A new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers is proposed, which achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers.

...read moreread less

41

Proceedings Article•10.1109/CLUSTER.2010.20

RDMA-Based Job Migration Framework for MPI over InfiniBand

Xiangyong Ouyang, +3 more

- 20 Sep 2010

TL;DR: This paper enhances the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme that transfers the processes running on a health-deteriorating node to a healthy spare node, and resumes these processes from the spare node.

...read moreread less

28

Proceedings Article•10.1109/HIPC.2009.5433218

Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture

Xiangyong Ouyang, +3 more

- 01 Dec 2009

TL;DR: The Write Aggregation with Dynamic Buffer and Interleaving scheme is proposed to reduce the overhead related to checkpoint creation by aggregating all checkpoint writes into a dynamic buffer pool and overlapping the application progress with the file writes to significantly reduce checkpoint creation overhead.

...read moreread less

22

•Proceedings Article•10.1109/ICPP.2011.85

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Xiangyong Ouyang, +5 more

- 13 Sep 2011

TL;DR: This paper proposes a new filesystem, named Checkpoint-Restart File system (CRFS), which is a lightweight user-level filesystem based on FUSE, which is the first such portable and light-weight filesystem designed for generic Checkpoint/Restart data.

...read moreread less

22

...

Expand

References

•Journal Article•10.1088/1742-6596/46/1/067

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Paul Hargrove, +1 more

- 01 Sep 2006

TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.

...read moreread less

486

•Proceedings Article•10.5555/762761.762815

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

George Bosilca, +11 more

- 16 Nov 2002

TL;DR: This work presents MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging, and presents a detailed performance evaluation of every component and its global performance for non-trivial parallel applications.

...read moreread less

338

•Report•10.2172/891617

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

Jason Duell

- 30 Apr 2005

- Lawrence Berkeley National Laboratory

TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.

...read moreread less

288

Proceedings Article•10.1109/IPDPS.2007.370605

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

Joshua Hursey, +3 more

- 26 Mar 2007

TL;DR: The design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project is presented and the framework is meant to be extensible and to encourage experimentation of alternative techniques within a production quality MPI implementation.

...read moreread less

210

•Proceedings Article•10.1145/331532.331573

Architectural Requirements and Scalability of the NAS Parallel Benchmarks

Frederick C. Wong, +3 more

- 01 Jan 1999

TL;DR: It is shown that the communication protocols used by MPI runtime library are influential to the communication performance in applications, and that the benchmark codes have a wide spectrum of communication requirements.

...read moreread less

126