Shallow Parsing for Hindi - An extensive analysis of sequential learning algorithms using a large annotated corpus

Author

Gahlot, Himanshu ; Krishnarao, Awaghad Ashish ; Kushwaha, D.S.

Author_Institution

Motilal Nehru Nat. Inst. of Technol., Allahabad

fYear

2009

fDate

6-7 March 2009

Firstpage

1158

Lastpage

1163

Abstract

In this paper, we provide the first comprehensive comparison of methods for part-of-speech tagging and chunking for Hindi. We present an analysis of the application of three major learning algorithms (viz. Maximum entropy models [2] [9], Conditional random fields [12] and Support Vector Machines [8]) to part-of-speech tagging and chunking for Hindi Language using datasets of different sizes. The use of language independent features make this analysis more general and capable of concluding important results for similar South and South East Asian Languages. The results show that CRFs outperform SVMs and Maxent in terms of accuracy. We are able to achieve an accuracy of 92.26% for part-of-speech tagging and 93.57% for chunking using Conditional Random Fields algorithm. The corpus we have used had 138177 annotated instances for training. We report results for three learning algorithms by varying various conditions (clustering, BIEO notation vs. BIES notation, multiclass methods for SVMs etc.) and present an extensive analysis of the whole process. These results will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.

Keywords

grammars; learning (artificial intelligence); natural language processing; support vector machines; word processing; Hindi language; South Asian Languages; South East Asian Languages; conditional random fields algorithm; language independent features; large annotated corpus; part-of-speech chunking; part-of-speech tagging; sequential learning algorithms; shallow parsing; Algorithm design and analysis; Clustering algorithms; Entropy; Hidden Markov models; Machine learning; Natural languages; Speech; Stochastic processes; Support vector machines; Tagging;

fLanguage

English

Publisher

ieee

Conference_Titel

Advance Computing Conference, 2009. IACC 2009. IEEE International

Conference_Location

Patiala

Print_ISBN

978-1-4244-2927-1

Electronic_ISBN

978-1-4244-2928-8

Type

conf

DOI

10.1109/IADCC.2009.4809178

Filename

4809178