Title :
Learning a stochastic part of speech tagger for sinhala
Author :
Jayasuriya, M. ; Weerasinghe, A.R.
Author_Institution :
Virtusa (Pvt) Ltd., Colombo, Sri Lanka
Abstract :
This paper presents the results of developing a part of speech (POS) tagger for Sinhala. The tagger is able to handle lexical items with multiple POS tags while also predicting POS tags of previously unseen words. A stochastic approach, Hidden Markov Model (HMM) with tri-gram probabilities was used as the training and tagging model. Linear Interpolation is used to smoothen the tri-gram probabilities while the Viterbi algorithm is used to decode the results of the HMM to decide on the best POS tags for each word. The tagger learns the lexical items (words and their possible POS tags) and the tri-gram probabilities using a POS tag annotated corpus. The tagger achieved an overall accuracy of 62%. Approximately 24% of the errors were for words whose POS tags have been unknown in the corpus. The lack of a Named Entity recognizer has also contributed to 10% of the overall error.
Keywords :
hidden Markov models; interpolation; learning (artificial intelligence); natural language processing; speech recognition; HMM; POS tagger; Sinhala language; Viterbi algorithm; hidden Markov model; learning; lexical items; linear interpolation; named entity recognizer; part-of-speech tagger; stochastic approach; tagging model; training model; tri-gram probabilities; Accuracy; Hidden Markov models; Probability; Speech; Stochastic processes; Tagging; Training; Hidden Markov Model; Linear Interpolation; Part of speech tagging; Sinhala language; Viterbi algorithm;
Conference_Titel :
Advances in ICT for Emerging Regions (ICTer), 2013 International Conference on
Conference_Location :
Colombo
Print_ISBN :
978-1-4799-1275-9
DOI :
10.1109/ICTer.2013.6761168