• DocumentCode
    3205497
  • Title

    Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds

  • Author

    Yang, Xiao ; Zola, Jaroslaw ; Aluru, Srinivas

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Iowa State Univ., Ames, IA, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1223
  • Lastpage
    1233
  • Abstract
    Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.
  • Keywords
    biology computing; cloud computing; genomics; graph theory; parallel algorithms; pattern clustering; hierarchical taxonomic clustering; map-reduce clouds; map-reduce framework; maximal quasi-clique enumeration; metagenomics classification problem; parallel algorithm; parallel metagenomic sequence clustering; similarity graph; similarity threshold; sketching technique; web document clustering; Clustering algorithms; Couplings; DNA; Organisms; Silicon; Strontium;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.116
  • Filename
    6012859