DocumentCode
2016451
Title
Performance under Failures of DAG-based Parallel Computing
Author
Jin, Hui ; Sun, Xian-He ; Zheng, Ziming ; Lan, Zhiling ; Xie, Bing
Author_Institution
Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL
fYear
2009
fDate
18-21 May 2009
Firstpage
236
Lastpage
243
Abstract
As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based scientific applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.
Keywords
directed graphs; optimisation; parallel processing; DAG-based parallel computing; directed acyclic graph; iterative approach; layered DAG; parallel system complexity; performance optimization; stochastic analysis; Analytical models; Failure analysis; Guidelines; Iterative methods; Large-scale systems; Optimization; Parallel processing; Performance analysis; Predictive models; Stochastic processes; Applicaiton Perfomrance; Directed Acyclic Graph; Failuer Modeling; Fault-Tolerance;
fLanguage
English
Publisher
ieee
Conference_Titel
Cluster Computing and the Grid, 2009. CCGRID '09. 9th IEEE/ACM International Symposium on
Conference_Location
Shanghai
Print_ISBN
978-1-4244-3935-5
Electronic_ISBN
978-0-7695-3622-4
Type
conf
DOI
10.1109/CCGRID.2009.55
Filename
5071877
Link To Document