• DocumentCode
    1418809
  • Title

    pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

  • Author

    Wu, Changjun ; Kalyanaraman, Ananth ; Cannon, William R.

  • Author_Institution
    Xerox Res. Center, Webster, NY, USA
  • Volume
    23
  • Issue
    10
  • fYear
    2012
  • Firstpage
    1923
  • Lastpage
    1933
  • Abstract
    Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problem particularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.
  • Keywords
    biology computing; graphs; parallel algorithms; proteins; ubiquitous computing; computational molecular biology; large-scale protein sequence homology graphs; pGraph; parallel algorithm; parallel construction; pervasive application; protein molecules; Amino acids; Computational modeling; DNA; Dynamic programming; Image edge detection; Protein sequence; Parallel protein sequence homology detection; hierarchical master-worker paradigm; parallel sequence graph construction; producer-consumer model;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.19
  • Filename
    6127863