Title :
Post-processing method of unknown word segmentation based on statistic of word frequency
Author :
Liu, Honglei ; Wang, Zhongjian
Author_Institution :
School of Computer and Information Engineering, Harbin University of Commerce, 150028, China
Abstract :
Unknown words recognition is a difficult problem. Taking advantage of segmented corpus, closed and open corpus (network resources) with word frequency statistics is a good method for unknown words recognition after general segmentation. The purpose for this method is segmenting character fragment after general segmentation and recognizing the unknown words which were not recognized in general. Here combining the maximum matching method and statistical method to calculate the frequency of strings for recognizing unknown words. Matching character fragments in three kind of corpus after general segmentation and that helps to recognize more unknown words. Experiments showed that recall increases by 12.14%, precision increases by 6.67% in the improved method. The results show that this method plays a good effect for unknown words segmentation.
Keywords :
Character recognition; Computational linguistics; Dictionaries; Heuristic algorithms; Mutual information; Probability; Statistical analysis; Maximum Match Method; character fragment; unknown words segmentation; word frequency;
Conference_Titel :
Information Science and Engineering (ICISE), 2010 2nd International Conference on
Conference_Location :
Hangzhou, China
Print_ISBN :
978-1-4244-7616-9
DOI :
10.1109/ICISE.2010.5688599