• DocumentCode
    3102508
  • Title

    Extracting Thai Compounds Using Collocations and POS Bigram Probabilities without a POS Tagger

  • Author

    Aroonmanakun, Wirote

  • Author_Institution
    Dept. of Linguistics, Chulalongkorn Univ., Bangkok, Thailand
  • fYear
    2009
  • fDate
    7-9 Dec. 2009
  • Firstpage
    118
  • Lastpage
    122
  • Abstract
    This paper presents a simple method to extract compounds using statistical collocations and POS bigram probabilities without a POS tagger. Statistical collocation was used to determine strength of word co-occurrences. Probabilities of POS sequences were used to adjust the strength of collocation within a possible compound. These probabilities were estimated from compounds found in the dictionary. Bigram and trigram words extracted from a corpus of 28 million words were ranked by two means, collocation scores and collocation scores weighted by POS pattern probabilities. Cutoff precision at every 200 points were calculated for both methods. The results showed that probabilities of POS sequences could increase the precision rate of compound extraction at certain level. The system can extract 2-word compounds and 3-word compounds at the precision rate up to 63% and 35% respectively. When eliminating bigram extractions that could be parts of trigram extraction, the precision rate is increased up to 71%.
  • Keywords
    grammars; linguistics; natural language processing; probability; statistical analysis; word processing; POS bigram probabilities; POS sequences; Thai; bigram extractions; bigram word; compounds extraction; part-of-speech; statistical collocations; trigram extraction; trigram word; word co-occurrences strength; Data mining; Dictionaries; Filters; Frequency; Morphology; Mutual information; Natural languages; Probability; Speech processing; Statistical analysis; Thai; collocation; compound extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing, 2009. IALP '09. International Conference on
  • Conference_Location
    Singapore
  • Print_ISBN
    978-0-7695-3904-1
  • Type

    conf

  • DOI
    10.1109/IALP.2009.33
  • Filename
    5380782