DocumentCode :
3114791
Title :
An efficient parallel approach for identifying protein families in large-scale metagenomic data sets
Author :
Wu, Changjun ; Kalyanaraman, Ananth
Author_Institution :
Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA
fYear :
2008
fDate :
15-21 Nov. 2008
Firstpage :
1
Lastpage :
10
Abstract :
Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.
Keywords :
biology computing; distributed memory systems; divide and conquer methods; genomics; graph theory; microorganisms; molecular biophysics; parallel algorithms; pattern matching; proteins; sequences; arbitrarily-sized dense subgraph detection; bipartite graph; combinatorial pattern matching heuristic technique; distributed memory machine; divide-and-conquer technique; environmental microbial community; large-scale metagenomic data set; parallel approach; peptide sequence; protein family identification; state-of-the-art genomic tool; Bioinformatics; Bipartite graph; Data processing; Genomics; Large-scale systems; Open source software; Peptides; Proteins; Sequences; Software algorithms;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4244-2834-2
Electronic_ISBN :
978-1-4244-2835-9
Type :
conf
DOI :
10.1109/SC.2008.5214891
Filename :
5214891
Link To Document :
بازگشت