Title :
BEST Corpus Development and Analysis
Author :
Boriboon, Monthika ; Kriengket, Kanyanut ; Chootrakool, Patcharika ; Phaholphinyo, Sitthaa ; Purodakananda, Sumonmas ; Thanakulwarapas, Tipraporn ; Kosawat, Krit
Author_Institution :
Human Language Technol. Lab., Nat. Electron. & Comput. Technol. Center, Pathumthani, Thailand
Abstract :
This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is ¿¿¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.
Keywords :
linguistics; word processing; BEST corpus development; Thai word segmentation software; word segmented corpus; Data analysis; Data mining; Encyclopedias; Frequency; Guidelines; Humans; Laboratories; Natural languages; Speech synthesis; Thai language; corpus annotation; word-segmented corpus;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.76