DocumentCode
2310543
Title
Compute and data management strategies for grid deployment of high throughput protein structure studies
Author
Stokes-Rees, Ian ; Sliz, Piotr
Author_Institution
Med. Sch., Dept. of Biol. Chem. & Mol. Pharmacology, Harvard Univ., Boston, MA, USA
fYear
2010
fDate
15-15 Nov. 2010
Firstpage
1
Lastpage
6
Abstract
The study of macromolecular protein structures at an atomic resolution is the source of many data and compute intensive challenges, from simulation, to image processing, to model building. We have developed a general platform for the secure deployment of structural biology computational tasks and workflows into a federated grid which maximizes robustness, ease of use, and performance, while minimizing data movement. This platform leverages several existing grid technologies for security and web-based data access, adding protocols for VO, user, task, workflow, and individual job data staging. We present the strategies used to deploy and maintain tens of GB of data and applications to a significant portion of the US Open Science Grid, and the workflow management mechanisms to optimize task execution, both for performance and correctness. Significant observations are made about real operating conditions in a grid environment from automated analysis of hundreds of thousands of jobs over extended periods. We specifically focus on one novel application which harnesses the capacity of national cyberinfrastructure to dramatically accelerate the process of protein structure determination. This workflow requires 20 - 50 thousand hours to compute with 1e5 tasks, requiring tens of GB of input data, and producing commensurate output. We demonstrate the success of our platform through the successful completion of this workflow in half a day using Open Science Grid.
Keywords
Internet; biology computing; data visualisation; grid computing; information retrieval; molecular biophysics; proteins; security of data; workflow management software; US open science grid; Web-based data access; atomic resolution; cyberinfrastructure; data management; data security; grid deployment; high throughput protein structure study; image processing; individual job data staging; macromolecular protein structure; protocol; structural biology computational task; workflow management;
fLanguage
English
Publisher
ieee
Conference_Titel
Many-Task Computing on Grids and Supercomputers (MTAGS), 2010 IEEE Workshop on
Conference_Location
New Orleans, LA
Print_ISBN
978-1-4244-9704-1
Electronic_ISBN
978-1-4244-9705-8
Type
conf
DOI
10.1109/MTAGS.2010.5699426
Filename
5699426
Link To Document