|
|
Vestnik Yuzhno-Ural'skogo Universiteta. Seriya Matematicheskoe Modelirovanie i Programmirovanie, 2010, Issue 6, Pages 91–103
(Mi vyuru231)
|
|
|
|
On program restoration from checkpoints set
A. Y. Polyakov Rzhanov Institute of Semiconductor Physics, Siberian Branch of Russian Academy of Sciences, Novosibirsk
Abstract:
In paper two approaches to distributed programs restore problem from checkpoints set are described. Computation node wide algorithm of parent-child relationships and group/session assignement recreation at restore time is proposed. Also coordinated algorithm for process set restoration from several nodes/terminals is designed. Described algorightms are implemented in checkpointing package called DMTCP (Distributed MultiThreaded CheckPointing).
Keywords:
HPC, rollback-recovery, checkpointing, fault tolerance.
Received: 16.04.2010
Citation:
A. Y. Polyakov, “On program restoration from checkpoints set”, Vestnik YuUrGU. Ser. Mat. Model. Progr., 2010, no. 6, 91–103
Linking options:
https://www.mathnet.ru/eng/vyuru231 https://www.mathnet.ru/eng/vyuru/y2010/i6/p91
|
|