Title :
PCA-based linear combinations of oligonucleotide frequencies for metagenomic DNA fragment binning
Author_Institution :
Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA
Abstract :
In this paper we have investigated linear combinations of oligonucleotide (k-mer) frequencies for binning the metagenomic DNA fragments of short-to-moderate lengths. The k-mer frequencies have been widely used for gene prediction, phylogenetic tree construction, and metagenomic binning. However, the k-mer frequencies will lead to a high dimensional feature space even for a modest value of k. Existing methods to reduce the dimensionality of the feature space focus on particular oligonucleotide patterns or rather small values of k. We have applied the principal component analysis (PCA) on the oligonucleotide frequencies, based on which we can not only achieve a reduction of the feature dimensionality at a ratio higher than five, but can also retain the most informative features. Our experiments on simulated metagenomic data sets with four types of classifiers have shown that (i) the PCA-based linear combinations of k-mer frequencies are capable of capturing the intrinsic characteristics of DNA fragments and can therefore adequately serve as the binning features; (ii) the PCA-based linear combinations of k-mer frequencies tend to be more effective and stable as the DNA fragment length increases; and (iii) the rather simple linear classifiers can achieve high accuracy for the metagenomic DNA fragment binning at various taxonomic levels, even at a level as specific as species.
Keywords :
DNA; bioinformatics; genetics; genomics; molecular biophysics; principal component analysis; DNA fragment length; PCA-based linear combinations; k-mer frequencies; metagenomic DNA fragment binning; oligonucleotide frequency; principal component analysis; Assembly; Bayesian methods; Bioinformatics; DNA; Frequency; Genomics; Phylogeny; Principal component analysis; Sequences; Support vector machines;
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on
Conference_Location :
Sun Valley, ID
Print_ISBN :
978-1-4244-1778-0
Electronic_ISBN :
978-1-4244-1779-7
DOI :
10.1109/CIBCB.2008.4675758