NLTK tagger for Albanian using iterative approach

Author

Kadriu, A.

Author_Institution

South East Eur. Univ., Tetove, Macedonia

fYear

2013

fDate

24-27 June 2013

Firstpage

283

Lastpage

288

Abstract

This paper presents a research done about a model of tagging for Albanian texts, using the NLTK toolkit. The model uses cascading of three taggers with backoff. We use a dictionary of around 32000 words, together their correspondent POS tags and a set of regular expressions rules too. A lemmatize module is implemented in order to convert nouns and verbs to their lemma. The text is tagged initially with a unigram tagger based on the dictionary. This is used as a baseline tagger for a regular expressions tagger. A correction is made for not correct lemmatized words, creating a third lookup tagger. This tagger will be used with the first and second tagger as backoff.

Keywords

dictionaries; iterative methods; natural language processing; text analysis; Albanian language; Albanian text; NLTK tagger; NLTK toolkit; POS tags; dictionary; iterative approach; lemmatize module; lemmatized words; lookup tagger; nouns; regular expressions rules; regular expressions tagger; taggers cascading; tagging model; text tagging; unigram tagger; verbs; Accuracy; Dictionaries; Economics; Hidden Markov models; Mood; Tagging; Training; Albanian language; NLTK; POS tagging;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology Interfaces (ITI), Proceedings of the ITI 2013 35th International Conference on

Conference_Location

Cavtat

ISSN

1334-2762

Print_ISBN

978-953-7138-30-1

Type

conf

DOI

10.2498/iti.2013.0565

Filename

6649039