DocumentCode
583246
Title
LSH-Div: Species diversity estimation using locality sensitive hashing
Author
Rasheed, Zeehasham ; Rangwala, Huzefa ; Barbará, Daniel
Author_Institution
Dept. of Comput. Sci., George Mason Univ., Fairfax, VA, USA
fYear
2012
fDate
4-7 Oct. 2012
Firstpage
1
Lastpage
6
Abstract
Metagenome sequencing projects attempt to determine the collective DNA of organisms, co-existing as communities across different environments. Computational approaches analyze the large volumes of sequence data obtained from these ecological samples, to provide an understanding of the species diversity, content and abundance. In this work we present a scalable, species diversity estimation algorithm that achieves computational efficiency by use of a locality sensitive hashing algorithm (LSH). Using fixed-length, gapless subsequences, we improve the sensitivity of pairwise sequence comparisons. Using the LSH-based function, we first group similar sequences into bins commonly referred to as operational taxonomic units (OTUs) and then compute several species diversity/richness metrics. The performance of our algorithm is evaluated on synthetic data and eight targeted metagenome samples obtained from the seawater. We compare our results to three state-of-the-art diversity estimation algorithms. We demonstrate the strength of our approach in terms of computational runtime and effective OTU assignments. The source code for LSH-Div is available at the supplementary website under the GNU GPL license. Supplementary material is available at http://www.cs.gmu.edu/~mlbio/LSH-DIV.
Keywords
DNA; bioinformatics; ecology; genomics; learning (artificial intelligence); molecular biophysics; search problems; DNA; LSH algorithm; LSH-Div; OTU; computational efficiency; ecological samples; fixed length gapless subsequences; locality sensitive hashing algorithm; metagenome sequencing projects; operational taxonomic units; pairwise sequence comparisons; scalable species diversity estimation algorithm; seawater metagenome samples; sequence data analysis; species abundance; species content; species diversity metrics; species richness metrics; Computer science; DNA; Educational institutions; Estimation; Indexes; Measurement; Standards; 16S metagenomics; clustering; species diversity;
fLanguage
English
Publisher
ieee
Conference_Titel
Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on
Conference_Location
Philadelphia, PA
Print_ISBN
978-1-4673-2559-2
Electronic_ISBN
978-1-4673-2558-5
Type
conf
DOI
10.1109/BIBM.2012.6392649
Filename
6392649
Link To Document