• DocumentCode
    108193
  • Title

    Extended Subtree: A New Similarity Function for Tree Structured Data

  • Author

    Shahbazi, Ali ; Miller, Jason

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Alberta, Edmonton, AB, Canada
  • Volume
    26
  • Issue
    4
  • fYear
    2014
  • fDate
    Apr-14
  • Firstpage
    864
  • Lastpage
    877
  • Abstract
    Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in many applications, ranging from document clustering to natural language processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches. Extensive evaluation frameworks are developed to evaluate the performance of the new approach against previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic labeled data sets are performed to provide an unbiased evaluation where different distance functions are investigated. The experimental results demonstrate the superior performance of the proposed distance function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree distance functions in terms of runtime efficiency.
  • Keywords
    data structures; document handling; natural language processing; pattern clustering; EST; distance functions; document clustering; extended subtree; natural language processing; similarity function; subtree mapping; synthetic labeled data sets; tree structured data; Complexity theory; Distance measurement; Entropy; Indexes; Runtime; Support vector machines; Tree classification; tree clustering; tree comparison; tree distance function; tree similarity; tree structured data;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2013.53
  • Filename
    6487504