Checkpoint/Restart Support for LAM/MPI

The ability to stop a running parallel application and start it at a later time is desirable, either for fault tolerance, system maintenance, or scheduling policy reasons. Although checkpoint/restart facilities exist for many supercomputer platforms, few exist for Linux clusters. This project aims to support LAM/MPI for a distributed kernel-level checkpoint/restart system. The LAM/MPI development is in collaboration with Lawrence Berkeley National Laboratory, where the kernel-level checkpoint system is being developed. The project will allow checkpoint/restart of LAM/MPI applications with complete user transparency.

Return to profiles


Last revised October 25, 2002
URL: http://www.research-indiana.org/iu_checkpoint.html
Copyright 2002, The Trustees of Indiana University
Comments: research@indiana.edu