• DocumentCode
    890728
  • Title

    A Study of Hierarchical and Flat Classification of Proteins

  • Author

    Zimek, Arthur ; Buchwald, Fabian ; Frank, Eibe ; Kramer, Stefan

  • Author_Institution
    Inst. flier Inf., Lehrund Forschungseinheit fuer Datenbanksysteme, Ludwig-Maximilians-Univ. Muenchen, Muenchen, Germany
  • Volume
    7
  • Issue
    3
  • fYear
    2010
  • Firstpage
    563
  • Lastpage
    571
  • Abstract
    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article, we investigate empirically whether this is the case for two such hierarchies. We compare multiclass classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multiclass settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data but not in the case of the protein classification problems. Based on this, we recommend that strong flat multiclass methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area.
  • Keywords
    biological techniques; biology computing; enzymes; molecular biophysics; support vector machines; enzyme classification; fold recognition; homology detection data; multiclass settings; nested dichotomy; protein flat classification; protein hierarchical classification; support vector machines; Biology and genetics; Classifier design and evaluation; Data mining; Machine learning; Protein classification; Sciences; hierarchical classification; multiclass classification.; Algorithms; Artificial Intelligence; Computing Methodologies; Molecular Sequence Data; Pattern Recognition, Automated; Protein Folding; Proteins;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2008.104
  • Filename
    4641909