• DocumentCode
    3101280
  • Title

    BEST Corpus Development and Analysis

  • Author

    Boriboon, Monthika ; Kriengket, Kanyanut ; Chootrakool, Patcharika ; Phaholphinyo, Sitthaa ; Purodakananda, Sumonmas ; Thanakulwarapas, Tipraporn ; Kosawat, Krit

  • Author_Institution
    Human Language Technol. Lab., Nat. Electron. & Comput. Technol. Center, Pathumthani, Thailand
  • fYear
    2009
  • fDate
    7-9 Dec. 2009
  • Firstpage
    322
  • Lastpage
    327
  • Abstract
    This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is ¿¿¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.
  • Keywords
    linguistics; word processing; BEST corpus development; Thai word segmentation software; word segmented corpus; Data analysis; Data mining; Encyclopedias; Frequency; Guidelines; Humans; Laboratories; Natural languages; Speech synthesis; Thai language; corpus annotation; word-segmented corpus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing, 2009. IALP '09. International Conference on
  • Conference_Location
    Singapore
  • Print_ISBN
    978-0-7695-3904-1
  • Type

    conf

  • DOI
    10.1109/IALP.2009.76
  • Filename
    5380726