• DocumentCode
    2710969
  • Title

    RBNBC: Repeat Based Naive Bayes Classifier for Biological Sequences

  • Author

    Rani, Pratibha ; Pudi, Vikram

  • Author_Institution
    Center for Data Eng., HIT Hyderabad, Hyderabad
  • fYear
    2008
  • fDate
    15-19 Dec. 2008
  • Firstpage
    989
  • Lastpage
    994
  • Abstract
    In this paper, we present RBNBC, a repeat based Naive Bayes classifier of bio-sequences that uses maximal frequent subsequences as features. RBNBC´s design is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically, RBNBC uses a novel formulation of Naive Bayes that incorporates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that it performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences. This is surprising as it is a pure data mining based generic classifier that does not require domain-specific background knowledge. We note that domain-specific ideas could further increase its performance.
  • Keywords
    Bayes methods; biology computing; data mining; pattern classification; biological sequences; data mining; domain-specific background knowledge; generic classifier; repeat based Naive Bayes classifier; state-of-the-art probabilistic classifiers; Bayesian methods; Data engineering; Data mining; Entropy; Feature extraction; Frequency estimation; Optimization methods; Proteins; Spatial databases; Support vector machines; Biological Sequence; Classification; Data Mining; Naive Bayes;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
  • Conference_Location
    Pisa
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3502-9
  • Type

    conf

  • DOI
    10.1109/ICDM.2008.66
  • Filename
    4781213