Title :
A user-transparent recoverable file system for distributed computing environment
Author :
Kim, Hyeong S. ; Yeom, Heon Y.
Author_Institution :
Dept. of Comput. Sci. & Eng., Seoul Nat. Univ., South Korea
Abstract :
In a distributed computing environment, particularly grid, fault-tolerance is one of the core functionalities the system should provide. MPICH-GF is such a resilient system designed to resist external or internal failures, especially for message passing applications in the grid environment. But it does not stand the loss of a valuable resource: files. In a normal case, users open files and write data into them in an asynchronous manner, and checkpointing is initiated with no regard to the state of the context of the process. Therefore, the checkpointing system should automatically recognize the running process and protect the open files transparently. We have implemented a recoverable file system, named ReFS, which is incorporated into our fault-tolerant system MPICH-GF. ReFS is a versioning-like file system. ReFS provides middleware libraries with the system call interface to protect specific files at a given time. This prevents applications from processing their jobs with corrupted data and resulting in incorrect results in case of failures. We have focused not only on the reliability of the system but also on the reduction of inevitable overheads. This paper describes the design and implementation of ReFS and justifies the validity of the behavior of ReFS. We have developed ReFS on Linux, based on Ext2.
Keywords :
Linux; checkpointing; fault tolerant computing; grid computing; middleware; software libraries; Ext2; Linux; MPICH-GF; ReFS; checkpointing system; distributed computing; fault-tolerant system; grid environment; job processing; message passing; middleware library; system call interface; system reliability; user-transparent recoverable file system; versioning-like file system; Checkpointing; Distributed computing; Fault tolerant systems; File systems; Libraries; Linux; Message passing; Middleware; Protection; Resists;
Conference_Titel :
Challenges of Large Applications in Distributed Environments, 2005. CLADE 2005. Proceedings
Print_ISBN :
0-7803-9043-1
DOI :
10.1109/CLADE.2005.1520898