• DocumentCode
    2897997
  • Title

    Chinese Word Segmentation and Out-of-Vocabulary Words Detection Using Suffix Array

  • Author

    Wenyan, Ji ; Tao, Peng ; Wanli, Zuo ; Fengling, He ; Huifeng, Zhu

  • Author_Institution
    Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
  • fYear
    2009
  • fDate
    7-8 Nov. 2009
  • Firstpage
    56
  • Lastpage
    60
  • Abstract
    At present Chinese word segmentation roughly consists of two mature methods: dictionary-based method and statistical method. Both methods have their own relative merits. To achieve a better performance, this paper proposes an algorithm using amended dictionary integrating with suffix array to detect OOV words. The essential idea is to accurately and efficiently extract medium and high frequency lexical items by using suffix array, and simultaneously extract other items using the amended dictionary. We mend the storage structure of the dictionary to improve the speed of segmenting. The experiment has proved that this method can improve the integrity and the accuracy effectively.
  • Keywords
    dictionaries; natural language processing; text analysis; vocabulary; Chinese word segmentation; amended dictionary; dictionary-based method; lexical items; out-of-vocabulary word detection; statistical method; suffix array; Computer science; Data mining; Dictionaries; Educational institutions; Frequency; Helium; Information processing; Information systems; Statistical analysis; Chinese information processing; Chinese word segmentation based on dictionary; HashMap; suffix array;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information Systems and Mining, 2009. WISM 2009. International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-0-7695-3817-4
  • Type

    conf

  • DOI
    10.1109/WISM.2009.19
  • Filename
    5368332