Title :
A novel approach to build Kannada web Corpus
Author :
Parameswarappa, S. ; Narayana, V.N. ; Bharathi, G.N.
Author_Institution :
Dept. of Comput. Sci. & Eng., Malnad Coll. of Eng., Hassan, India
Abstract :
This paper introduces the Kannada Corpus tool, a suite of Perl (Program Extraction and Reporting Language) programs implementing an iterative procedure to build Kannada corpora from the web. The procedure requires is, first a set of "seed" words list is built and later a set of “seed” URLs (Uniform Resource Locator) containing documents in the Kannada language is collected by sending queries to commercial search engines (Google and Yahoo). The obtained seeds are then used to start a crawling job using the open-source, command-line based downloading tool "wget". The downloaded documents are then processed in various ways in order to build Kannada raw corpora such as HTML (Hyper Text Markup Language) code removal, boilerplate stripping, and language identification, duplicate and near duplicate detection. We conducted an evaluation of the tool by applying it to the construction of Kannada corpora from the domains such as Recent Discussions, Articles, Recent Activities, Proverbs, Recent Feedback\´s, Poems and Fifteen Books, Novels, News paper, Dictionary, Blogs and Informal Chats. The results illustrate the potential usefulness of the tool.
Keywords :
Internet; Perl; document handling; hypermedia markup languages; natural language processing; public domain software; query processing; search engines; Google; HTML code removal; Kannada Web corpus; Kannada corpus tool; Kannada language; Perl programs; Yahoo; boilerplate stripping; crawling job; duplicate detection; hyper text markup language; iterative procedure; language identification; near duplicate detection; open-source command-line based downloading tool; program extraction-and-reporting language; search engines; seed URL; seed words list; uniform resource locator; wget; Corpora; Kannada corpus; Part-of-Speech (POS) tagging; Tokenizer; World Wide Web; wget;
Conference_Titel :
Computer Communication and Informatics (ICCCI), 2012 International Conference on
Conference_Location :
Coimbatore
Print_ISBN :
978-1-4577-1580-8
DOI :
10.1109/ICCCI.2012.6158824