Title :
Feliss: Flexible distributed computing framework with light-weight checkpointing
Author :
Araki, Takeshi ; Narita, Kazuyo ; Tamano, Hiroshi
Author_Institution :
Cloud Syst. Res. Labs., NEC Corp., Japan
Abstract :
Current distributed computing frameworks, such as MapReduce and Spark, allow programmers to use only limited operations defined by the framework. Because of this restriction, algorithms that do not fit with the framework cannot be efficiently expressed. This restriction arose from the need of fault-tolerance. That is, these frameworks recover lost data by re-computing them from available data when a fault occurs. To ensure this mechanism works correctly, only operations provided by the system can be used. On the other hand, there is another fault-tolerance method called checkpointing. Since it achieves fault-tolerance by saving memory contents, there is no such limitation to operations. However, the cost of saving a memory image is high. To overcome this trade-off, we propose a light-weight checkpointing method called continuation-based checkpointing, which enables low overhead fault-tolerance without any restriction. It saves only the information that is necessary for restarting, which significantly reduces the cost of checkpointing. We implemented a distributed computing framework called Feliss by using our continuation-based checkpointing method, which includes an improved MapReduce without the above restriction and a message passing interface (MPI) subset. We evaluated Feliss with various applications and showed that order-of-magnitude speedup can be attained with applications that cannot be expressed efficiently with current frameworks.
Keywords :
fault tolerant computing; message passing; Feliss; MPI subset; MapReduce; Spark; continuation-based checkpointing; fault-tolerance method; flexible distributed computing framework; lightweight checkpointing method; memory contents; memory image; message passing interface subset; Checkpointing; Data structures; Fault tolerance; Fault tolerant systems; Libraries; Servers; Sparks;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691566