مرکز منطقه ای اطلاع رساني علوم و فناوري - Fault tolerant parallel data-intensive algorithms

DocumentCode :

2037708

Title :

Fault tolerant parallel data-intensive algorithms

Author :

Kutlu, Mucahid ; Agrawal, Gagan ; Kurt, Orhan

Author_Institution :

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

fYear :

2012

fDate :

18-22 Dec. 2012

Firstpage :

Lastpage :

Abstract :

Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.

Keywords :

checkpointing; data mining; message passing; parallel processing; software fault tolerance; Hadoop; MPI applications; MapReduce based solution; algorithm-based fault-tolerance; checkpointing; data blocks; distributed computing; fail-stop failures; fault tolerant parallel data-intensive algorithms; high-end computing; k-means; mean-time-to-failure; parallel data mining algorithms; parallel programs;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing (HiPC), 2012 19th International Conference on

Conference_Location :

Pune

Print_ISBN :

978-1-4673-2372-7

Electronic_ISBN :

978-1-4673-2370-3

Type :

conf

DOI :

10.1109/HiPC.2012.6507503

Filename :

6507503

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2037708