• DocumentCode
    174850
  • Title

    Classifying Bacterial Genomes with Compact Logic Formulas on k-Mer Frequencies

  • Author

    Weitschek, E. ; Cunial, F. ; Felici, G.

  • Author_Institution
    Dept. of Eng., Roma Tre Univ., Rome, Italy
  • fYear
    2014
  • fDate
    1-5 Sept. 2014
  • Firstpage
    69
  • Lastpage
    73
  • Abstract
    Alignment-free methods are routinely used in largescale, gene-independent phylogeny reconstruction. Such methods measure the similarity of two genomes by comparing the frequency of all their distinct substrings of length k. In this paper we apply logic data mining methods to discover a minimal subset of k-mers whose frequency information is sufficient to reliably classify bacterial genomes into the corresponding taxa. Specifically, we extract separating, disjunctive normal form logic formulas, predicated on the discretized relative frequencies of few selected k-mers in the genomes. Such formulas are derived using a combination of feature selection, integer programming and adhoc heuristics. Interestingly, we reliably classify strain genomes at multiple taxonomic levels using extremely compact formulas, each involving just few k-mers. Classification performance is promising, suggesting that the phylogenetic signal of each class is strong enough and that our discretization and feature selection approach is effective and robust in identifying it.
  • Keywords
    bioinformatics; data mining; feature selection; genomics; integer programming; microorganisms; pattern classification; ad-hoc heuristics; alignment-free methods; bacterial genome classification; compact logic formulas; discretized relative frequencies; feature selection; frequency information; genome similarity measurement; integer programming; k-mer frequencies; large-scale-gene-independent phylogeny reconstruction; logic data mining methods; minimal k-mer subset discovery; multiple taxonomic levels; phylogenetic signal; separated-disjunctive normal form logic formula extraction; strain genome classification; substring frequency; Bioinformatics; Data mining; Digital multimedia broadcasting; Genomics; Microorganisms; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
  • Conference_Location
    Munich
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4799-5721-7
  • Type

    conf

  • DOI
    10.1109/DEXA.2014.30
  • Filename
    6974829