• DocumentCode
    2310543
  • Title

    Compute and data management strategies for grid deployment of high throughput protein structure studies

  • Author

    Stokes-Rees, Ian ; Sliz, Piotr

  • Author_Institution
    Med. Sch., Dept. of Biol. Chem. & Mol. Pharmacology, Harvard Univ., Boston, MA, USA
  • fYear
    2010
  • fDate
    15-15 Nov. 2010
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    The study of macromolecular protein structures at an atomic resolution is the source of many data and compute intensive challenges, from simulation, to image processing, to model building. We have developed a general platform for the secure deployment of structural biology computational tasks and workflows into a federated grid which maximizes robustness, ease of use, and performance, while minimizing data movement. This platform leverages several existing grid technologies for security and web-based data access, adding protocols for VO, user, task, workflow, and individual job data staging. We present the strategies used to deploy and maintain tens of GB of data and applications to a significant portion of the US Open Science Grid, and the workflow management mechanisms to optimize task execution, both for performance and correctness. Significant observations are made about real operating conditions in a grid environment from automated analysis of hundreds of thousands of jobs over extended periods. We specifically focus on one novel application which harnesses the capacity of national cyberinfrastructure to dramatically accelerate the process of protein structure determination. This workflow requires 20 - 50 thousand hours to compute with 1e5 tasks, requiring tens of GB of input data, and producing commensurate output. We demonstrate the success of our platform through the successful completion of this workflow in half a day using Open Science Grid.
  • Keywords
    Internet; biology computing; data visualisation; grid computing; information retrieval; molecular biophysics; proteins; security of data; workflow management software; US open science grid; Web-based data access; atomic resolution; cyberinfrastructure; data management; data security; grid deployment; high throughput protein structure study; image processing; individual job data staging; macromolecular protein structure; protocol; structural biology computational task; workflow management;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Many-Task Computing on Grids and Supercomputers (MTAGS), 2010 IEEE Workshop on
  • Conference_Location
    New Orleans, LA
  • Print_ISBN
    978-1-4244-9704-1
  • Electronic_ISBN
    978-1-4244-9705-8
  • Type

    conf

  • DOI
    10.1109/MTAGS.2010.5699426
  • Filename
    5699426