Title :
Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document
Author :
Bi, Narin ; Taing, Nguonly
Author_Institution :
R. Univ. of Phnom Penh, Phnom Penh, Cambodia
Abstract :
One of major key component in Khmer language processing is how to transform Khmer texts into series of separated Khmer words. But unlike in Latin languages such as English or French; Khmer language does not have any explicit word boundary delimiters such as blank space to separate between each word. Moreover, Khmer language has more complex structure to word form which causes Khmer Unicode standard ordering of character components to permit different orders that lead to the same visual representation; exactly looking word, but different character order. Even more, Khmer word could also be a join of two or more Khmer words together. All these complications address many challenges in Khmer word segmentation to determine word boundaries. Response to these challenges and try to improve level of accuracy and performance in Khmer word segmentation, this paper presents a study on Bidirectional Maximal Matching (BiMM) with Khmer Clusters, Khmer Unicode character order correction, corpus list optimization to reduce frequency of dictionary lookup and Khmer text manipulation tweaks. The study also focuses on how to implement Khmer word segmentation on both Khmer contents in Plaintext and Microsoft Word document. For Word document, the implementation is done on currently active Word document and also on file Word document. The study compares the implementation of Bi-directional Maximal Matching (BiMM) with Forward Maximal Matching (FMM) and Backward Maximal Matching (BMM) and also with similar algorithm from previous study. The result of study is 98.13% on accuracy with time spend of 2.581 seconds for Khmer contents of 1,110,809 characters which is about 160,000 of Khmer words.
Keywords :
natural language processing; pattern matching; text analysis; word processing; BMM; BiMM; FMM; Khmer Unicode character order correction; Khmer clusters; Khmer language processing; Khmer word segmentation; Microsoft word document; backward maximal matching; bi-directional maximal matching; corpus list optimization; forward maximal matching; plaintext document; Abstracts; Accuracy; Bidirectional control; Decision support systems; Standards; Transforms; Visualization; Backward Maximal Matching; Bi-directional Maximal Matching; Forward Maximal Matching; Khmer Cluster; Khmer Unicode; Word Segmentation;
Conference_Titel :
Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA)
Conference_Location :
Siem Reap
DOI :
10.1109/APSIPA.2014.7041822