مرکز منطقه ای اطلاع رساني علوم و فناوري - Smoothing Algorithm for N-Gram Model Using Agglutinative Characteristic of Korean

DocumentCode :

3466721

Title :

Smoothing Algorithm for N-Gram Model Using Agglutinative Characteristic of Korean

Author :

Park, Jae-Hyun ; Song, Young-In ; Rim, Hae-Chang

Author_Institution :

Korea Univ., Seoul

fYear :

2007

fDate :

17-19 Sept. 2007

Firstpage :

397

Lastpage :

404

Abstract :

Smoothing for an n-gram language model is an algorithm that can assign a non-zero probability to an unseen n-gram. Smoothing is an essential technique for an n-gram language model due to the data sparseness problem. However, in some circumstances it assigns an improper amount of probability to unseen n-grams. In this paper, we present a novel method that adjusts the improperly assigned probabilities of unseen n-grams by taking advantage of the agglutinative characteristics of Korean language. In Korean, the grammatically proper class of a morpheme can be predicted by knowing the previous morpheme. By using this characteristic, we try to prevent grammatically improper n-grams from achieving relatively higher probability and to assign more probability mass to proper n-grams. Experimental results show that the proposed method can achieve 8.6% - 12.5% perplexity reductions for Katz backoff algorithm and 4.9% - 7.0% perplexity reductions for Kneser-Ney Smoothing.

Keywords :

natural languages; probability; Katz backoff algorithm; Korean agglutinative characteristic; data sparseness problem; n-gram language model; non-zero probability; smoothing algorithm; Computer science; Information resources; Natural languages; Smoothing methods; Testing; Training data; Vocabulary;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Semantic Computing, 2007. ICSC 2007. International Conference on

Conference_Location :

Irvine, CA

Print_ISBN :

978-0-7695-2997-4

Type :

conf

DOI :

10.1109/ICSC.2007.66

Filename :

4338374

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3466721