Title :
Chinese Word Segmentation and Out-of-Vocabulary Words Detection Using Suffix Array
Author :
Wenyan, Ji ; Tao, Peng ; Wanli, Zuo ; Fengling, He ; Huifeng, Zhu
Author_Institution :
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
Abstract :
At present Chinese word segmentation roughly consists of two mature methods: dictionary-based method and statistical method. Both methods have their own relative merits. To achieve a better performance, this paper proposes an algorithm using amended dictionary integrating with suffix array to detect OOV words. The essential idea is to accurately and efficiently extract medium and high frequency lexical items by using suffix array, and simultaneously extract other items using the amended dictionary. We mend the storage structure of the dictionary to improve the speed of segmenting. The experiment has proved that this method can improve the integrity and the accuracy effectively.
Keywords :
dictionaries; natural language processing; text analysis; vocabulary; Chinese word segmentation; amended dictionary; dictionary-based method; lexical items; out-of-vocabulary word detection; statistical method; suffix array; Computer science; Data mining; Dictionaries; Educational institutions; Frequency; Helium; Information processing; Information systems; Statistical analysis; Chinese information processing; Chinese word segmentation based on dictionary; HashMap; suffix array;
Conference_Titel :
Web Information Systems and Mining, 2009. WISM 2009. International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-0-7695-3817-4
DOI :
10.1109/WISM.2009.19