DocumentCode
2897997
Title
Chinese Word Segmentation and Out-of-Vocabulary Words Detection Using Suffix Array
Author
Wenyan, Ji ; Tao, Peng ; Wanli, Zuo ; Fengling, He ; Huifeng, Zhu
Author_Institution
Coll. of Comput. Sci. & Technol., Jilin Univ., Changchun, China
fYear
2009
fDate
7-8 Nov. 2009
Firstpage
56
Lastpage
60
Abstract
At present Chinese word segmentation roughly consists of two mature methods: dictionary-based method and statistical method. Both methods have their own relative merits. To achieve a better performance, this paper proposes an algorithm using amended dictionary integrating with suffix array to detect OOV words. The essential idea is to accurately and efficiently extract medium and high frequency lexical items by using suffix array, and simultaneously extract other items using the amended dictionary. We mend the storage structure of the dictionary to improve the speed of segmenting. The experiment has proved that this method can improve the integrity and the accuracy effectively.
Keywords
dictionaries; natural language processing; text analysis; vocabulary; Chinese word segmentation; amended dictionary; dictionary-based method; lexical items; out-of-vocabulary word detection; statistical method; suffix array; Computer science; Data mining; Dictionaries; Educational institutions; Frequency; Helium; Information processing; Information systems; Statistical analysis; Chinese information processing; Chinese word segmentation based on dictionary; HashMap; suffix array;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Information Systems and Mining, 2009. WISM 2009. International Conference on
Conference_Location
Shanghai
Print_ISBN
978-0-7695-3817-4
Type
conf
DOI
10.1109/WISM.2009.19
Filename
5368332
Link To Document