Title :
A non deterministic Indonesian stemmer
Author :
Purwarianti, Ayu
Author_Institution :
Sch. of Electr. Eng. & Inf., Bandung Inst. of Technol., Bandung, Indonesia
Abstract :
A stemmer is a basic natural language processing tool that is widely used for many text based applications such as information retrieval or question answering engine. Existing Indonesian stemmer gives only one alternative of word result which is a deterministic way even though the problem is shown as a non deterministic. The existing algorithm selects only the first fit morphology rule defined in the system. It gives inaccurate result for two problems: words with more than one word candidate result (such as “perbaikan” with “per - an” or “per - kan”) and words with more than one affix combination (such as “beruang” or “mereka”). To handle these problems, this research proposes a stemmer with more accurate word results by employing a non deterministic algorithm which gives more than one word candidate result and more than one affix combination. Here, the word result does not depend on order of the morphology rule. All rules are checked and the word results are kept in a candidate list. To make an efficient stemmer, two kinds of word list (vocabulary) are used: words that have more than one candidate word and list of root word as a candidate reference. The final word results are selected with several heuristic rules. This strategy is proved to have better result than the two most known Indonesian stemmers. The experiments showed that the proposed approach gave higher accuracy than the two most known compared systems.
Keywords :
deterministic algorithms; natural language processing; text analysis; vocabulary; affix combination; information retrieval; natural language processing tool; nondeterministic Indonesian stemmer; question answering engine; text based application; word list; Accuracy; Algorithm design and analysis; Complexity theory; Compounds; Dictionaries; Morphology; Vocabulary; Indonesian stemmer; affix combination; morphologically ambiguous word; non deterministic;
Conference_Titel :
Electrical Engineering and Informatics (ICEEI), 2011 International Conference on
Conference_Location :
Bandung
Print_ISBN :
978-1-4577-0753-7
DOI :
10.1109/ICEEI.2011.6021829