Title :
Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism
Author :
Cabello, Uriel ; Rodriguez, Jose ; Meneses, Amilcar ; Mendoza, Sergio ; Decouchant, Dominique
Author_Institution :
Dept. of Comput. Sci., Center of Res. & Adv. Studies, Mexico City, Mexico
fDate :
Sept. 29 2014-Oct. 3 2014
Abstract :
The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.
Keywords :
fault tolerant computing; grid computing; parallel programming; Open MPI programming; OpenCL programming; data analysis; error handling; fault tolerance; grid computing paradigm; heterogeneous computer systems; heterogeneous distributed clusters; heterogeneous multi-cluster systems; heterogeneous network interfaces; hybrid programming model; task migration mechanism; Computational modeling; Fault tolerance; Fault tolerant systems; Kernel; Programming; Proposals;
Conference_Titel :
Electrical Engineering, Computing Science and Automatic Control (CCE), 2014 11th International Conference on
Conference_Location :
Campeche
Print_ISBN :
978-1-4799-6228-0
DOI :
10.1109/ICEEE.2014.6978266