Title :
Compute and data management strategies for grid deployment of high throughput protein structure studies
Author :
Stokes-Rees, Ian ; Sliz, Piotr
Author_Institution :
Med. Sch., Dept. of Biol. Chem. & Mol. Pharmacology, Harvard Univ., Boston, MA, USA
Abstract :
The study of macromolecular protein structures at an atomic resolution is the source of many data and compute intensive challenges, from simulation, to image processing, to model building. We have developed a general platform for the secure deployment of structural biology computational tasks and workflows into a federated grid which maximizes robustness, ease of use, and performance, while minimizing data movement. This platform leverages several existing grid technologies for security and web-based data access, adding protocols for VO, user, task, workflow, and individual job data staging. We present the strategies used to deploy and maintain tens of GB of data and applications to a significant portion of the US Open Science Grid, and the workflow management mechanisms to optimize task execution, both for performance and correctness. Significant observations are made about real operating conditions in a grid environment from automated analysis of hundreds of thousands of jobs over extended periods. We specifically focus on one novel application which harnesses the capacity of national cyberinfrastructure to dramatically accelerate the process of protein structure determination. This workflow requires 20 - 50 thousand hours to compute with 1e5 tasks, requiring tens of GB of input data, and producing commensurate output. We demonstrate the success of our platform through the successful completion of this workflow in half a day using Open Science Grid.
Keywords :
Internet; biology computing; data visualisation; grid computing; information retrieval; molecular biophysics; proteins; security of data; workflow management software; US open science grid; Web-based data access; atomic resolution; cyberinfrastructure; data management; data security; grid deployment; high throughput protein structure study; image processing; individual job data staging; macromolecular protein structure; protocol; structural biology computational task; workflow management;
Conference_Titel :
Many-Task Computing on Grids and Supercomputers (MTAGS), 2010 IEEE Workshop on
Conference_Location :
New Orleans, LA
Print_ISBN :
978-1-4244-9704-1
Electronic_ISBN :
978-1-4244-9705-8
DOI :
10.1109/MTAGS.2010.5699426