• DocumentCode
    1196054
  • Title

    A Lazy Data Mining Approach for Protein Classification

  • Author

    Merschmann, Luiz ; Plastino, Alexandre

  • Author_Institution
    Dept. of Comput. Sci., Univ. Fed. Fluminense, Niteroi
  • Volume
    6
  • Issue
    1
  • fYear
    2007
  • fDate
    3/1/2007 12:00:00 AM
  • Firstpage
    36
  • Lastpage
    42
  • Abstract
    In this work, we propose a new computational technique to solve the protein classification problem. The goal is to predict the functional family of novel protein sequences based on their motif composition. In order to improve the results obtained with other known approaches, we propose a new data mining technique for protein classification based on Bayes´ theorem, called highest subset probability (HiSP). To evaluate our proposal, datasets extracted from Prosite, a curated protein family database, are used as experimental datasets. The computational results have shown that the proposed method outperforms other known methods for all tested datasets and looks very promising for problems with characteristics similar to the problem addressed here. In addition, our experiments suggest that HiSP performs well on highly imbalanced datasets
  • Keywords
    Bayes methods; biology computing; data mining; molecular biophysics; probability; proteins; Bayes theorem; Prosite; curated protein family database; highest subset probability; lazy data mining; motif composition; protein classification; protein sequences; Amino acids; Computer science; Data mining; Databases; Decision trees; Information resources; Learning automata; Learning systems; Protein sequence; Testing; Data mining; lazy learning; protein classification; Algorithms; Amino Acid Sequence; Database Management Systems; Databases, Protein; Information Storage and Retrieval; Molecular Sequence Data; Proteins; Sequence Alignment; Sequence Analysis, Protein; Sequence Homology, Amino Acid;
  • fLanguage
    English
  • Journal_Title
    NanoBioscience, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1536-1241
  • Type

    jour

  • DOI
    10.1109/TNB.2007.891910
  • Filename
    4118127