DocumentCode :
583246
Title :
LSH-Div: Species diversity estimation using locality sensitive hashing
Author :
Rasheed, Zeehasham ; Rangwala, Huzefa ; Barbará, Daniel
Author_Institution :
Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
fYear :
2012
fDate :
4-7 Oct. 2012
Firstpage :
1
Lastpage :
6
Abstract :
Metagenome sequencing projects attempt to determine the collective DNA of organisms, co-existing as communities across different environments. Computational approaches analyze the large volumes of sequence data obtained from these ecological samples, to provide an understanding of the species diversity, content and abundance. In this work we present a scalable, species diversity estimation algorithm that achieves computational efficiency by use of a locality sensitive hashing algorithm (LSH). Using fixed-length, gapless subsequences, we improve the sensitivity of pairwise sequence comparisons. Using the LSH-based function, we first group similar sequences into bins commonly referred to as operational taxonomic units (OTUs) and then compute several species diversity/richness metrics. The performance of our algorithm is evaluated on synthetic data and eight targeted metagenome samples obtained from the seawater. We compare our results to three state-of-the-art diversity estimation algorithms. We demonstrate the strength of our approach in terms of computational runtime and effective OTU assignments. The source code for LSH-Div is available at the supplementary website under the GNU GPL license. Supplementary material is available at http://www.cs.gmu.edu/~mlbio/LSH-DIV.
Keywords :
DNA; bioinformatics; ecology; genomics; learning (artificial intelligence); molecular biophysics; search problems; DNA; LSH algorithm; LSH-Div; OTU; computational efficiency; ecological samples; fixed length gapless subsequences; locality sensitive hashing algorithm; metagenome sequencing projects; operational taxonomic units; pairwise sequence comparisons; scalable species diversity estimation algorithm; seawater metagenome samples; sequence data analysis; species abundance; species content; species diversity metrics; species richness metrics; Computer science; DNA; Educational institutions; Estimation; Indexes; Measurement; Standards; 16S metagenomics; clustering; species diversity;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on
Conference_Location :
Philadelphia, PA
Print_ISBN :
978-1-4673-2559-2
Electronic_ISBN :
978-1-4673-2558-5
Type :
conf
DOI :
10.1109/BIBM.2012.6392649
Filename :
6392649
Link To Document :
بازگشت