|
Checkpoint/Restart Support for LAM/MPI
The ability to stop a running parallel application and start it at a
later time is desirable, either for fault tolerance, system
maintenance, or scheduling policy reasons. Although
checkpoint/restart facilities exist for many supercomputer platforms,
few exist for Linux clusters. This project aims to support LAM/MPI for
a distributed kernel-level checkpoint/restart system. The LAM/MPI
development is in collaboration with Lawrence Berkeley National
Laboratory, where the kernel-level checkpoint system is being
developed. The project will allow checkpoint/restart of LAM/MPI
applications with complete user transparency.
Return to profiles
|