Title :
Automatic word spacing using syllable n-grame and word statistics [n-grame read n-gram]
Author :
Kang, Mi-Young ; Choi, Sung-Ja ; Heo, Hee-Keun ; Lim, Sung-Shin ; Kwon, Hyuk-Chul
Author_Institution :
Sch. of Electr. & Comput. Eng., Pusan Nat. Univ., South Korea
Abstract :
In this study, we have proposed an automatic word spacing system for the Korean language, which uses syllable n-gram and word statistics extracted from a large amount of processed corpora. The optimal spacing points of a sentence are decided mainly by using the Viterbi algorithm. As the statistical studies performance is sensitive to the training corpus and shows data sparseness problem, we have tried to enlarge the training corpora, used parameters found by examining test data and proposed an adjusting method of the ´longest match strategy´ based on the viable prefix. These increase the system´s accuracy. Our corpora, covering various language registers, were made up of 33643884 words. The pilot test was conducted with test data derived from different sources. 94.24% precision in word-unit correction were obtained on average for spacing test data.
Keywords :
linguistics; maximum likelihood estimation; natural languages; word processing; Korean language; Viterbi algorithm; automatic word spacing system; data sparseness problem; syllable n-gram statistics; training corpora; word statistics; Error correction; Frequency estimation; Information retrieval; Natural language processing; Natural languages; Speech synthesis; Statistical analysis; Statistics; Testing; Viterbi algorithm;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location :
Beijing, China
Print_ISBN :
0-7803-7902-0
DOI :
10.1109/NLPKE.2003.1275942