• DocumentCode
    2248133
  • Title

    A comparative study on two large-scale hierarchical text classification tasks´ solutions

  • Author

    Zhang, Jian ; Zhao, Hai ; Lu, Bao-Liang

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Shanghai Jiao Tong Univ., Shanghai, China
  • Volume
    6
  • fYear
    2010
  • fDate
    11-14 July 2010
  • Firstpage
    3275
  • Lastpage
    3280
  • Abstract
    Patent classification is a large scale hierarchical text classification (LSHTC) task. Though comprehensive comparisons, either learning algorithms or feature selection strategies, have been fully made in the text categorization field, few work was done for a LSHTC task due to high computational cost and complicated structural label characteristics. For the first time, this paper compares two popular learning frameworks, namely hierarchical support vector machine (SVM) and k nearest neighbor (k-NN) that are applied to a LSHTC task. Experiment results show that the latter outperforms the former in this LSHTC task, which is quite different from the usual results for normal text categorization tasks. Then this paper does a comparative study on different similarity measures and ranking approaches in k-NN framework for LSHTC task. Conclusions can be drawn that k-NN is more appropriate for the LSHTC task than hierarchical SVM and for a specific LSHTC task. BM25 outperforms other similarity measures and List Weak gains a better performance than other ranking approaches. We also find an interesting phenomenon that using all the labels of the retrieved neighbors can remarkably improve classification performance over only using the first label of the retrieved neighbors.
  • Keywords
    learning (artificial intelligence); patents; pattern classification; support vector machines; text analysis; BM25; ListWeak; feature selection strategies; hierarchical support vector machine; k nearest neighbor; large scale hierarchical text classification tasks; learning algorithms; patent classification; text categorization; Classification algorithms; Nearest neighbor searches; Patents; Support vector machines; Taxonomy; Text categorization; Training; Hierarchical SVM; Hierarchical text classification; Large-scale text classification; Ranking approach; Similarity measure; Text classification; comparative study; k-NN;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
  • Conference_Location
    Qingdao
  • Print_ISBN
    978-1-4244-6526-2
  • Type

    conf

  • DOI
    10.1109/ICMLC.2010.5580696
  • Filename
    5580696