مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparison of Statistical Methods to Classify Environmental Genomic Fragments

DocumentCode :

1336881

Title :

Comparison of Statistical Methods to Classify Environmental Genomic Fragments

Author :

Rosen, Gail L. ; Essinger, Steven D.

Author_Institution :

Dept. of Electr. & Comput. Eng., Drexel Univ., Philadelphia, PA, USA

Volume :

Issue :

fYear :

2010

Firstpage :

310

Lastpage :

316

Abstract :

“Binning” (or taxonomic classification) of DNA sequence reads is an initial step to analyzing an environmental biological sample. Currently, a homology-based tool, BLAST, is one of the most commonly used tools to label DNA reads, but it is argued that BLAST will quickly lose its classification ability as the genome databases grow. In this paper, we compare the accuracies of a naïve Bayes classifier (NBC) and statistical language model to BLAST for binning reads and demonstrate that NBC obtains good performance for the low cost of computational complexity. On the other hand, the back-off n-gram language model can improve accuracy when only partial training data is available (such as in-progress sequencing projects). NBC demonstrates comparable performance to BLAST and can also be optimized on partial training datasets by adjusting the word feature size. A fivefold cross validation is conducted to compare each method´s accuracy for determining novel genomes at different taxonomic levels, with NBC outperforming BLAST for species-level classification but BLAST outperforming NBC for genus-level and phyla-level classification. In conclusion, the NBC is a competitive taxonomic classifier, and language models can improve performance when only partial training data is available.

Keywords :

Bayes methods; DNA; bioinformatics; genomics; molecular biophysics; pattern classification; BLAST; DNA sequence binning; DNA sequence taxonomic classification; back off n-gram language model; environmental genomic fragment classification; genome databases; genus level classification; homology based tool; naive Bayes classifier; phyla level classification; species level classification; statistical language model; statistical methods; word feature size; Accuracy; Bioinformatics; DNA; Genomics; Statistical learning; Taxonomy; Training; Training data; Bayesian classification; DNA; language models; metagenomics; Bayes Theorem; Databases, Genetic; Genome; Metagenomics; Models, Statistical; Peptide Fragments; Sequence Analysis, DNA;

fLanguage :

English

Journal_Title :

NanoBioscience, IEEE Transactions on

Publisher :

ieee

ISSN :

1536-1241

Type :

jour

DOI :

10.1109/TNB.2010.2081375

Filename :

5586656

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1336881