• DocumentCode
    1278280
  • Title

    A Framework for Learning Comprehensible Theories in XML Document Classification

  • Author

    Wu, Jemma

  • Author_Institution
    Dept. of Environ. & Geogr., Macquarie Univ., Sydney, NSW, Australia
  • Volume
    24
  • Issue
    1
  • fYear
    2012
  • Firstpage
    1
  • Lastpage
    14
  • Abstract
    XML has become the universal data format for a wide variety of information systems. The large number of XML documents existing on the web and in other information storage systems makes classification an important task. As a typical type of semistructured data, XML documents have both structures and contents. Traditional text learning techniques are not very suitable for XML document classification as structures are not considered. This paper presents a novel complete framework for XML document classification. We first present a knowledge representation method for XML documents which is based on a typed higher order logic formalism. With this representation method, an XML document is represented as a higher order logic term where both its contents and structures are captured. We then present a decision-tree learning algorithm driven by precision/recall breakeven point (PRDT) for the XML classification problem which can produce comprehensible theories. Finally, a semi-supervised learning algorithm is given which is based on the PRDT algorithm and the cotraining framework. Experimental results demonstrate that our framework is able to achieve good performance in both supervised and semi-supervised learning with the bonus of producing comprehensible learning theories.
  • Keywords
    Internet; XML; formal logic; knowledge representation; learning (artificial intelligence); pattern classification; storage management; XML document classification; comprehensible theories; decision-tree learning algorithm; information storage systems; information systems; knowledge representation method; precision/recall breakeven point; semi-supervised learning algorithm; typed higher order logic formalism; universal data format; web; Knowledge representation; Learning systems; Machine learning; Machine learning algorithms; Supervised learning; Unsupervised learning; XML; XML document; knowledge representation; machine learning; semi-supervised learning.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2011.158
  • Filename
    5959167