A diskless chekpointing approach for failure recovery in multiprocessor safety-critical embedded systems

A diskless chekpointing approach for failure recovery in multiprocessor safety-critical embedded systems Backward recovery is the one of the most important techniques for error recovery in safety-criticalsystems which are usually based on nonvolatile memories. Since storing checkpoints in hard disks -as a nonvolatile memory- imposes noteworthy timing overhead to the system, diskless checkpointing would be a good solution for low cost fault tolerance in parallel and distributed systems. In this paper an algorithm is proposed which is able to recover a multiprocessor system from failure when up to half of the processors are failed, simultaneously. In contrast to many existing work, in the presented work each processor can have more than one task. The algorithm also by grouping tasks and by coding checkpoints eliminates the need of hard and nonvolatile disks to store checkpoints. The simulation results show the ability of the proposed algorithm in recovering system from failure when up to half of processors are simultaneously failed without using any extra dedicated checkpointing processor. Also compared to the existing approaches, the presented method requires fewer processors.