• DocumentCode
    2011699
  • Title

    Impact of Word Segmentation Errors on Automatic Chinese Text Classification

  • Author

    Luo, Xi ; Ohyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka

  • Author_Institution
    Grad. Sch. of Eng., Mie Univ., Tsu, Japan
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    271
  • Lastpage
    275
  • Abstract
    In this paper, several sets of experiments were carried out to study the impact of word segmentation errors on automatic Chinese text classification. Comparison experiment of four word-based approaches was first carried out and the results show that the performance was significantly reduced when using automatic word segmentation instead of manual word segmentation which means errors caused by automatic word segmentation have an obvious impact on classification performance. We further conducted the experiment using character-based approach (N-gram). Although N-gram approach produces a large number of ambiguous words, the results show that it performed better than automatic word segmentation.
  • Keywords
    pattern classification; text analysis; word processing; N-gram approach; automatic Chinese text classification; automatic word segmentation; character-based approach; classification performance; word segmentation error impact; word-based approach; Kernel; Machine learning; Manuals; Support vector machine classification; Text categorization; Training data; Chinese text classification/categorization; ICTCLAS; N-gram; support vector machine; word segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.43
  • Filename
    6195377