مرکز منطقه ای اطلاع رساني علوم و فناوري - An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

DocumentCode :

3114791

Title :

An efficient parallel approach for identifying protein families in large-scale metagenomic data sets

Author :

Wu, Changjun ; Kalyanaraman, Ananth

Author_Institution :

Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA, USA

fYear :

2008

fDate :

15-21 Nov. 2008

Firstpage :

Lastpage :

Abstract :

Metagenomics is the study of environmental microbial communities using state-of-the-art genomic tools. Recent advancements in high-throughput technologies have enabled the accumulation of large volumes of metagenomic data that was until a couple of years back was deemed impractical for generation. A primary bottleneck, however, is in the lack of scalable algorithms and open source software for large-scale data processing. In this paper, we present the design and implementation of a novel parallel approach to identify protein families from large-scale metagenomic data. Given a set of peptide sequences we reduce the problem to one of detecting arbitrarily-sized dense subgraphs from bipartite graphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. We present performance and quality results of extensively testing our implementation on 160 K randomly sampled sequences from the CAMERA environmental sequence database using 512 nodes of a BlueGene/L supercomputer.

Keywords :

biology computing; distributed memory systems; divide and conquer methods; genomics; graph theory; microorganisms; molecular biophysics; parallel algorithms; pattern matching; proteins; sequences; arbitrarily-sized dense subgraph detection; bipartite graph; combinatorial pattern matching heuristic technique; distributed memory machine; divide-and-conquer technique; environmental microbial community; large-scale metagenomic data set; parallel approach; peptide sequence; protein family identification; state-of-the-art genomic tool; Bioinformatics; Bipartite graph; Data processing; Genomics; Large-scale systems; Open source software; Peptides; Proteins; Sequences; Software algorithms;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for

Conference_Location :

Austin, TX

Print_ISBN :

978-1-4244-2834-2

Electronic_ISBN :

978-1-4244-2835-9

Type :

conf

DOI :

10.1109/SC.2008.5214891

Filename :

5214891

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3114791