• DocumentCode
    3723100
  • Title

    Fully-Automatic XML Clustering by Structure-Constrained Phrases

  • Author

    Gianni Costa;Riccardo Ortale

  • Author_Institution
    ICAR Inst., Rende, Italy
  • fYear
    2015
  • Firstpage
    146
  • Lastpage
    153
  • Abstract
    Conventional approaches to XML clustering by content and structure are generally affected by a limitation due to the adoption of the bag-of-word model for the representation of their textual contents. This choice may lead to consider structure-constrained textual items of separate XML documents as related, even though the actual meaning of such items in their respective contexts is different. To overcome such a limitation, we propose XML clustering by structure-constrained phrases. The latter is a previously unexplored method relying on the more accurate bag-of-phrase model of the XML textual content, with which to better preserve the meaning of the structure-constrained content items for improved clustering effectiveness. In order to conduct an in-depth and systematic study of the effectiveness of the proposed method, we develop a parameter-free prototypical approach to XML partitioning, which projects the XML documents into a space of XML features representing fixed-length sequences of adjacent textual items in the context of root-to-leaf paths. Feature selection without any tunable threshold is used to choose a subset of the XML features on the basis of their relevance to clustering, which is assessed through a new scoring scheme. A comparative experimentation on real-world benchmark XML corpora reveals a higher effectiveness than several state-of-the-art competitors.
  • Keywords
    "XML","Vegetation","Context","Data models","Information retrieval","Benchmark testing","Electronic mail"
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence (ICTAI), 2015 IEEE 27th International Conference on
  • ISSN
    1082-3409
  • Type

    conf

  • DOI
    10.1109/ICTAI.2015.34
  • Filename
    7372130