• DocumentCode
    1464050
  • Title

    A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids

  • Author

    Moretti, Christopher ; Thrasher, Andrew ; Yu, Li ; Olson, Michael ; Emrich, Scott ; Thain, Douglas

  • Author_Institution
    Princeton Univ., Princeton, NJ, USA
  • Volume
    23
  • Issue
    12
  • fYear
    2012
  • Firstpage
    2189
  • Lastpage
    2197
  • Abstract
    Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require uncommon high-end hardware. This paper introduces the Scalable Assembler at Notre Dame (SAND) framework that can achieve significant speedup using large numbers of commodity machines harnessed from clusters, clouds, and grids. SAND interfaces with the Celera open-source assembly toolkit, replacing two independent sequential modules with scalable parallel alternatives: the candidate selector exploits distributed memory capacity, and the sequence aligner exploits distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency. We show results for several data sets ranging from 738 thousand to over 320 million alignments using resources ranging from a small cluster to more than a thousand nodes spanning three institutions.
  • Keywords
    bioinformatics; cloud computing; genomics; grid computing; pattern clustering; user interfaces; Celera open-source assembly toolkit; SAND interfaces; Scalable Assembler at Notre Dame framework; bioinformatics researchers; candidate selector; clouds; clusters; commodity machines; data management; distributed memory capacity; genomic sequence data; grids; independent sequential modules; parallelization; scalable genome assembly; scalable parallel alternatives; task management; Bioinformatics; Biomedical informatics; Cloud computing; Distributed processing; Genomics; Random access memory; Distributed systems; bioinformatics; genome assembly;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.80
  • Filename
    6165266