Title :
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
Author :
Wu, Changjun ; Kalyanaraman, Ananth
Author_Institution :
Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA
Abstract :
Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
Keywords :
biology computing; distributed memory systems; divide and conquer methods; genomics; graph theory; microorganisms; molecular biophysics; parallel algorithms; pattern matching; proteins; sequences; arbitrarily-sized dense subgraph detection; bipartite graph; combinatorial pattern matching heuristic technique; distributed memory machine; divide-and-conquer technique; environmental microbial community; large-scale metagenomic data set; parallel approach; peptide sequence; protein family identification; state-of-the-art genomic tool; Bioinformatics; Bipartite graph; Data processing; Genomics; Large-scale systems; Open source software; Peptides; Proteins; Sequences; Software algorithms;
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4244-2834-2
Electronic_ISBN :
978-1-4244-2835-9
DOI :
10.1109/SC.2008.5214891