DocumentCode
3102508
Title
Extracting Thai Compounds Using Collocations and POS Bigram Probabilities without a POS Tagger
Author
Aroonmanakun, Wirote
Author_Institution
Dept. of Linguistics, Chulalongkorn Univ., Bangkok, Thailand
fYear
2009
fDate
7-9 Dec. 2009
Firstpage
118
Lastpage
122
Abstract
This paper presents a simple method to extract compounds using statistical collocations and POS bigram probabilities without a POS tagger. Statistical collocation was used to determine strength of word co-occurrences. Probabilities of POS sequences were used to adjust the strength of collocation within a possible compound. These probabilities were estimated from compounds found in the dictionary. Bigram and trigram words extracted from a corpus of 28 million words were ranked by two means, collocation scores and collocation scores weighted by POS pattern probabilities. Cutoff precision at every 200 points were calculated for both methods. The results showed that probabilities of POS sequences could increase the precision rate of compound extraction at certain level. The system can extract 2-word compounds and 3-word compounds at the precision rate up to 63% and 35% respectively. When eliminating bigram extractions that could be parts of trigram extraction, the precision rate is increased up to 71%.
Keywords
grammars; linguistics; natural language processing; probability; statistical analysis; word processing; POS bigram probabilities; POS sequences; Thai; bigram extractions; bigram word; compounds extraction; part-of-speech; statistical collocations; trigram extraction; trigram word; word co-occurrences strength; Data mining; Dictionaries; Filters; Frequency; Morphology; Mutual information; Natural languages; Probability; Speech processing; Statistical analysis; Thai; collocation; compound extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location
Singapore
Print_ISBN
978-0-7695-3904-1
Type
conf
DOI
10.1109/IALP.2009.33
Filename
5380782
Link To Document