DocumentCode
3601432
Title
Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting
Author
Thuy-Diem Nguyen ; Schmidt, Bertil ; Zejun Zheng ; Chee-Keong Kwoh
Author_Institution
Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Volume
12
Issue
5
fYear
2015
Firstpage
1060
Lastpage
1073
Abstract
De novo clustering is a popular technique to perform taxonomic profiling of a microbial community by grouping 16S rRNA amplicon reads into operational taxonomic units (OTUs). In this work, we introduce a new dendrogram-based OTU clustering pipeline called CRiSPy. The key idea used in CRiSPy to improve clustering accuracy is the application of an anomaly detection technique to obtain a dynamic distance cutoff instead of using the de facto value of 97 percent sequence similarity as in most existing OTU clustering pipelines. This technique works by detecting an abrupt change in the merging heights of a dendrogram. To produce the output dendrograms, CRiSPy employs the OTU hierarchical clustering approach that is computed on a genetic distance matrix derived from an all-against-all read comparison by pairwise sequence alignment. However, most existing dendrogram-based tools have difficulty processing datasets larger than 10,000 unique reads due to high computational complexity. We address this difficulty by developing two efficient algorithms for CRiSPy: a compute-efficient GPU-accelerated parallel algorithm for pairwise distance matrix computation and a memory-efficient hierarchical clustering algorithm. Our experiments on various datasets with distinct attributes show that CRiSPy is able to produce more accurate OTU groupings than most OTU clustering applications.
Keywords
RNA; bioinformatics; genetics; genomics; graphics processing units; microorganisms; parallel algorithms; GPU-based sequence alignment; anomaly detection technique; computational complexity; compute-efficient GPU-accelerated parallel algorithm; dendrogram-based OTU clustering pipeline; dynamic dendrogram cutting; genetic distance matrix; memory-efficient hierarchical clustering algorithm; microbial community; operational taxonomic units; pairwise distance matrix computation; pairwise sequence alignment; rRNA amplicon reads; Bioinformatics; Clustering algorithms; Computational biology; Genetics; Graphics processing units; Sparse matrices; GPU-accelerated distance matrix computation; OTU clustering; dynamic dendrogram cutting; hierarchical clustering;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2015.2407574
Filename
7050241
Link To Document