• DocumentCode
    1468359
  • Title

    Composition Vector Method Based on Maximum Entropy Principle for Sequence Comparison

  • Author

    Chan, R.H. ; Chan, T.H. ; Hau Man Yeung ; Wang, R.W.

  • Author_Institution
    Dept. of Math., Chinese Univ. of Hong Kong, Hong Kong, China
  • Volume
    9
  • Issue
    1
  • fYear
    2012
  • Firstpage
    79
  • Lastpage
    87
  • Abstract
    The composition vector (CV) method is an alignment-free method for sequence comparison. Because of its simplicity when compared with multiple sequence alignment methods, the method has been widely discussed lately; and some formulas based on probabilistic models, like Hao´s and Yu´s formulas, have been proposed. In this paper, we improve these formulas by using the entropy principle which can quantify the nonrandomness occurrence of patterns in the sequences. More precisely, existing formulas are used to generate a set of possible formulas from which we choose the one that maximizes the entropy. We give the closed-form solution to the resulting optimization problem. Hence, from any given CV formula, we can find the corresponding one that maximizes the entropy. In particular, we show that Hao´s formula is itself maximizing the entropy and we derive a new entropy-maximizing formula from Yu´s formula. We illustrate the accuracy of our new formula by using both simulated and experimental data sets. For the simulated data sets, our new formula gives the best consensus and significant values for three different kinds of evolution models. For the data set of tetrapod 18S rRNA sequences, our new formula groups the clades of bird and reptile together correctly, where Hao´s and Yu´s formulas failed. Using real data sets with different sizes, we show that our formula is more accurate than Hao´s and Yu´s formulas even for small data sets.
  • Keywords
    macromolecules; maximum entropy methods; molecular biophysics; organic compounds; physiological models; probability; Haos formula; Yus formula; closed-form solution; composition vector method; entropy-maximizing formula; maximum entropy principle; multiple sequence alignment methods; optimization problem; probabilistic models; sequence comparison; tetrapod 18S rRNA sequences; Bioinformatics; Computational modeling; Entropy; Estimation; Optimization; Phylogeny; Strain; Composition vector method; alignment-free sequence comparison; maximum entropy principle; optimization model; phylogenetics.; Animals; Bacteria; Computational Biology; Computer Simulation; Databases, Genetic; Humans; Markov Chains; Models, Genetic; Phylogeny; Sequence Analysis, DNA;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2011.45
  • Filename
    5728790