Suffix stripping algorithm for Kannada information retrieval

Author

Hegde, Yashaswini ; Kadambe, Shubha ; Naduthota, Prashantha

Author_Institution

NIEIT, Mysore, India

fYear

2013

Firstpage

527

Lastpage

533

Abstract

Due to the explosion of usage of the internet and websites a huge amount of data on the web is available in languages other than English. Hence, it is important to develop Information Retrieval (IR) tools for other languages too for web searches. This is as challenging as developing an IR tool for English since each language has unique characteristics of its own. In the development of an IR tool for a particular language we need to consider the specifics of that language. Due to unique characteristics of each language, many efficient algorithms developed for the IR in English language cannot be used directly. Here we consider a south Indian language Kannada and propose a suffix stripping algorithm. This algorithm is for the Kannada text available on-line in unicode. It is a rule based approach that strips fourteen different major classifications of suffixes (pratyaya in Kannada) and some sub classes. It also covers suffixes associated with nouns, verbs, articles, adjectives and stop words. This algorithm will be very useful to rank Kannada documents( that are represented by a bag-of-words) based on relevance upon a query in web searches. It can also be used in Kannada (i) text extraction, (ii) natural language processing tools and (iii)speech recognition engines. We have implemented this suffix stripping stemming algorithm and have evaluated it using Kannada documents from “Kendasampige” a web based magazine with and without our stemming algorithm. We used several metrics for the evaluation. Our results indicate that the recall factor is much better after stemming. This promising preliminary results imply the applicability of this algorithm to the above mentioned applications.

Keywords

Internet; Web sites; natural language processing; query processing; text analysis; English language; IR tool; Internet; Kannada documents; Kannada information retrieval; Kannada text unicode; Kendasampige; Web based magazine; Web search; Web search query; Web sites; information retrieval tools; natural language processing tools; pratyaya; rule based approach; south Indian language; speech recognition engines; suffix stripping stemming algorithm; text extraction; Foot; Instruments; Information Retrieval; Kannada; Stemming; Suffix Stripping;

fLanguage

English

Publisher

ieee

Conference_Titel

Advances in Computing, Communications and Informatics (ICACCI), 2013 International Conference on

Conference_Location

Mysore

Print_ISBN

978-1-4799-2432-5

Type

conf

DOI

10.1109/ICACCI.2013.6637227

Filename

6637227