Learning a stochastic part of speech tagger for sinhala

Author

Jayasuriya, M. ; Weerasinghe, A.R.

Author_Institution

Virtusa (Pvt) Ltd., Colombo, Sri Lanka

fYear

2013

fDate

11-15 Dec. 2013

Firstpage

137

Lastpage

143

Abstract

This paper presents the results of developing a part of speech (POS) tagger for Sinhala. The tagger is able to handle lexical items with multiple POS tags while also predicting POS tags of previously unseen words. A stochastic approach, Hidden Markov Model (HMM) with tri-gram probabilities was used as the training and tagging model. Linear Interpolation is used to smoothen the tri-gram probabilities while the Viterbi algorithm is used to decode the results of the HMM to decide on the best POS tags for each word. The tagger learns the lexical items (words and their possible POS tags) and the tri-gram probabilities using a POS tag annotated corpus. The tagger achieved an overall accuracy of 62%. Approximately 24% of the errors were for words whose POS tags have been unknown in the corpus. The lack of a Named Entity recognizer has also contributed to 10% of the overall error.

Keywords

hidden Markov models; interpolation; learning (artificial intelligence); natural language processing; speech recognition; HMM; POS tagger; Sinhala language; Viterbi algorithm; hidden Markov model; learning; lexical items; linear interpolation; named entity recognizer; part-of-speech tagger; stochastic approach; tagging model; training model; tri-gram probabilities; Accuracy; Hidden Markov models; Probability; Speech; Stochastic processes; Tagging; Training; Hidden Markov Model; Linear Interpolation; Part of speech tagging; Sinhala language; Viterbi algorithm;

fLanguage

English

Publisher

ieee

Conference_Titel

Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on

Conference_Location

Colombo

Print_ISBN

978-1-4799-1275-9

Type

conf

DOI

10.1109/ICTer.2013.6761168

Filename

6761168