DocumentCode :
1684458
Title :
Supporting fault-tolerance in streaming grid applications
Author :
Zhu, Qian ; Chen, Liang ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH
fYear :
2008
Firstpage :
1
Lastpage :
12
Abstract :
This paper considers the problem of supporting and efficiently implementing fault-tolerance for tightly-coupled and pipelined applications, especially streaming applications, in a grid environment. We provide an alternative to basic checkpointing and use the notion of light-weight summary structure(LSS) to enable efficient failure-recovery. The idea behind LSS is that at certain points during the execution of a processing stage, the state of the program can be summarized by a small amount of memory. This allows us to store copies of LSS for enabling failure-recovery, which causes low overhead fault-tolerance. Our work can be viewed as an optimization and adaptation of the idea of application-level checkpointing to a different execution environment, and for a different class of applications. Our implementation and evaluation of LSS based failure- recovery has been in the context of the GATES (grid-based adaptive execution on streams) middleware. An observation we use for providing very low overhead support for fault-tolerance is that algorithms analyzing data streams are only allowed to take a single pass over data, which means they only perform approximate processing. Therefore, we believe that in supporting fault-tolerant execution for these applications, it is acceptable to not analyze a small number of packets of data during failure-recovery. We show how we perform failure-recovery and also demonstrate how we could use additional buffers to limit data loss during the recovery procedure. We also present an efficient algorithm for allocating a new computation resource for failure-recovery at runtime. We have extensively evaluated our implementation using three stream data processing applications, and shown that the use of LSS allows effective and low-overhead failure-recovery.
Keywords :
grid computing; middleware; software fault tolerance; failure-recovery; fault-tolerance; grid-based adaptive execution on streams middleware; light-weight summary structure; streaming grid applications; Algorithm design and analysis; Checkpointing; Data analysis; Data processing; Failure analysis; Fault tolerance; Middleware; Performance analysis; Resource management; Runtime;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on
Conference_Location :
Miami, FL
ISSN :
1530-2075
Print_ISBN :
978-1-4244-1693-6
Electronic_ISBN :
1530-2075
Type :
conf
DOI :
10.1109/IPDPS.2008.4536296
Filename :
4536296
Link To Document :
بازگشت