DocumentCode
2452276
Title
A hybrid method to segment words
Author
Dai, Yubiao ; Ren, Xueli
Author_Institution
Dept. of Comput. Sci. & Eng., QuJing Normal Univ., Qujing, China
fYear
2012
fDate
16-18 July 2012
Firstpage
1131
Lastpage
1134
Abstract
Word segmentation is the foundations of machine translation, text classification and information searching. A method is proposed which combines word segmentation based on dictionary with reverse maximum matching and word segmentation based on statistic with suffix array. The input texts are segmented using the reserve maximum matching method based on dictionary, and a two-way suffix arrays are constructed, longest common prefix are computed, candidate words are filtered out by setting the threshold, the candidate words are filtered using mutual information in order to the true words. The texts that are ambiguity are filtered using information entropy. It is showed that the accuracy of word segmentation may achieve above 97% in the experiment.
Keywords
language translation; natural language processing; pattern classification; text analysis; common prefix; hybrid method; information entropy; information searching; input texts; machine translation; suffix array; text classification; word segmentation; Accuracy; Arrays; Dictionaries; Information filters; Matched filters; Sorting;
fLanguage
English
Publisher
ieee
Conference_Titel
Audio, Language and Image Processing (ICALIP), 2012 International Conference on
Conference_Location
Shanghai
Print_ISBN
978-1-4673-0173-2
Type
conf
DOI
10.1109/ICALIP.2012.6376786
Filename
6376786
Link To Document