• DocumentCode
    3394305
  • Title

    PCA-based linear combinations of oligonucleotide frequencies for metagenomic DNA fragment binning

  • Author

    Wu, Hongwei

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA
  • fYear
    2008
  • fDate
    15-17 Sept. 2008
  • Firstpage
    46
  • Lastpage
    53
  • Abstract
    In this paper we have investigated linear combinations of oligonucleotide (k-mer) frequencies for binning the metagenomic DNA fragments of short-to-moderate lengths. The k-mer frequencies have been widely used for gene prediction, phylogenetic tree construction, and metagenomic binning. However, the k-mer frequencies will lead to a high dimensional feature space even for a modest value of k. Existing methods to reduce the dimensionality of the feature space focus on particular oligonucleotide patterns or rather small values of k. We have applied the principal component analysis (PCA) on the oligonucleotide frequencies, based on which we can not only achieve a reduction of the feature dimensionality at a ratio higher than five, but can also retain the most informative features. Our experiments on simulated metagenomic data sets with four types of classifiers have shown that (i) the PCA-based linear combinations of k-mer frequencies are capable of capturing the intrinsic characteristics of DNA fragments and can therefore adequately serve as the binning features; (ii) the PCA-based linear combinations of k-mer frequencies tend to be more effective and stable as the DNA fragment length increases; and (iii) the rather simple linear classifiers can achieve high accuracy for the metagenomic DNA fragment binning at various taxonomic levels, even at a level as specific as species.
  • Keywords
    DNA; bioinformatics; genetics; genomics; molecular biophysics; principal component analysis; DNA fragment length; PCA-based linear combinations; k-mer frequencies; metagenomic DNA fragment binning; oligonucleotide frequency; principal component analysis; Assembly; Bayesian methods; Bioinformatics; DNA; Frequency; Genomics; Phylogeny; Principal component analysis; Sequences; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB '08. IEEE Symposium on
  • Conference_Location
    Sun Valley, ID
  • Print_ISBN
    978-1-4244-1778-0
  • Electronic_ISBN
    978-1-4244-1779-7
  • Type

    conf

  • DOI
    10.1109/CIBCB.2008.4675758
  • Filename
    4675758