Title :
Neural network-based taxonomic clustering for metagenomics
Author :
Essinger, Steven D. ; Polikar, Robi ; Rosen, Gail L.
Author_Institution :
Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA, USA
Abstract :
Metagenomic studies inherently involve sampling genetic information from an environment potentially containing thousands of distinctly different microbial organisms. This genetic information is sequenced producing many short fragments (<;500 base pair (bp)); each is tentatively a small representative of the DNA coding structure. Any of the fragments may belong to any of the organisms in the sample, but the relationship is unknown a priori. Furthermore, most of these organisms have not been identified and correspondingly are not represented in any of the publicly available search databases. Our goal is to be able to predict the taxonomic classification of an organism based on the fragments obtained from an environmental sample that may include many (some previously unidentified) organisms. To elucidate the diversity and composition of the sample, we first use a supervised naive Bayes classifier to score the fragments of known genomes, followed by an unsupervised clustering to group fragments from similar organisms together. We are then free to analyze each cluster separately. This is challenging since we are not interested in similar sequences, but sequences that come from similar genomes, which are known to vary widely intra-genomically. Our dataset comprises of an extremely challenging scenario involving clustering fragments at the phyla level, where none of the phyla have been previously seen or identified. We present two variations of our proposed approach, one based on ART and K-means. We show that ART can cluster 500bp fragments from 17 novel phyla at an overall isolation/grouping that is 10% better than K-means and nearly 7 times over chance.
Keywords :
ART neural nets; Bayes methods; DNA; biology computing; pattern classification; pattern clustering; ART approach; DNA coding structure; K-means approach; metagenomics; microbial organisms; neural network; phyla level; supervised naïve Bayes classifier; taxonomic classification; taxonomic clustering; unsupervised clustering; Classification algorithms; Clustering algorithms; DNA; Genomics; Strain; Subspace constraints; Training;
Conference_Titel :
Neural Networks (IJCNN), The 2010 International Joint Conference on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4244-6916-1
DOI :
10.1109/IJCNN.2010.5596644