BEST Corpus Development and Analysis

Author

Boriboon, Monthika ; Kriengket, Kanyanut ; Chootrakool, Patcharika ; Phaholphinyo, Sitthaa ; Purodakananda, Sumonmas ; Thanakulwarapas, Tipraporn ; Kosawat, Krit

Author_Institution

Human Language Technol. Lab., Nat. Electron. & Comput. Technol. Center, Pathumthani, Thailand

fYear

2009

fDate

7-9 Dec. 2009

Firstpage

322

Lastpage

327

Abstract

This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is Â¿Â¿Â¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.

Keywords

linguistics; word processing; BEST corpus development; Thai word segmentation software; word segmented corpus; Data analysis; Data mining; Encyclopedias; Frequency; Guidelines; Humans; Laboratories; Natural languages; Speech synthesis; Thai language; corpus annotation; word-segmented corpus;

fLanguage

English

Publisher

ieee

Conference_Titel

Asian Language Processing, 2009. IALP '09. International Conference on

Conference_Location

Singapore

Print_ISBN

978-0-7695-3904-1

Type

conf

DOI

10.1109/IALP.2009.76

Filename

5380726