DocumentCode
3101280
Title
BEST Corpus Development and Analysis
Author
Boriboon, Monthika ; Kriengket, Kanyanut ; Chootrakool, Patcharika ; Phaholphinyo, Sitthaa ; Purodakananda, Sumonmas ; Thanakulwarapas, Tipraporn ; Kosawat, Krit
Author_Institution
Human Language Technol. Lab., Nat. Electron. & Comput. Technol. Center, Pathumthani, Thailand
fYear
2009
fDate
7-9 Dec. 2009
Firstpage
322
Lastpage
327
Abstract
This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is ¿¿¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.
Keywords
linguistics; word processing; BEST corpus development; Thai word segmentation software; word segmented corpus; Data analysis; Data mining; Encyclopedias; Frequency; Guidelines; Humans; Laboratories; Natural languages; Speech synthesis; Thai language; corpus annotation; word-segmented corpus;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location
Singapore
Print_ISBN
978-0-7695-3904-1
Type
conf
DOI
10.1109/IALP.2009.76
Filename
5380726
Link To Document