Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document

Author

Bi, Narin ; Taing, Nguonly

Author_Institution

R. Univ. of Phnom Penh, Phnom Penh, Cambodia

fYear

2014

fDate

9-12 Dec. 2014

Firstpage

1

Lastpage

9

Abstract

One of major key component in Khmer language processing is how to transform Khmer texts into series of separated Khmer words. But unlike in Latin languages such as English or French; Khmer language does not have any explicit word boundary delimiters such as blank space to separate between each word. Moreover, Khmer language has more complex structure to word form which causes Khmer Unicode standard ordering of character components to permit different orders that lead to the same visual representation; exactly looking word, but different character order. Even more, Khmer word could also be a join of two or more Khmer words together. All these complications address many challenges in Khmer word segmentation to determine word boundaries. Response to these challenges and try to improve level of accuracy and performance in Khmer word segmentation, this paper presents a study on Bidirectional Maximal Matching (BiMM) with Khmer Clusters, Khmer Unicode character order correction, corpus list optimization to reduce frequency of dictionary lookup and Khmer text manipulation tweaks. The study also focuses on how to implement Khmer word segmentation on both Khmer contents in Plaintext and Microsoft Word document. For Word document, the implementation is done on currently active Word document and also on file Word document. The study compares the implementation of Bi-directional Maximal Matching (BiMM) with Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) and also with similar algorithm from previous study. The result of study is 98.13% on accuracy with time spend of 2.581 seconds for Khmer contents of 1,110,809 characters which is about 160,000 of Khmer words.

Keywords

natural language processing; pattern matching; text analysis; word processing; BMM; BiMM; FMM; Khmer Unicode character order correction; Khmer clusters; Khmer language processing; Khmer word segmentation; Microsoft word document; backward maximal matching; bi-directional maximal matching; corpus list optimization; forward maximal matching; plaintext document; Abstracts; Accuracy; Bidirectional control; Decision support systems; Standards; Transforms; Visualization; Backward Maximal Matching; Bi-directional Maximal Matching; Forward Maximal Matching; Khmer Cluster; Khmer Unicode; Word Segmentation;

fLanguage

English

Publisher

ieee

Conference_Titel

Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA)

Conference_Location

Siem Reap

Type

conf

DOI

10.1109/APSIPA.2014.7041822

Filename

7041822