CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
Report ID: TR-543-97Author: Chen, Yuqun / Li, Kai / Plank, James S.
Date: 1997-05-00
Pages: 10
Download Formats: |Postscript|
Abstract:
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.