Unigram language identifications using adaptive neutral network

Author

Selamat, Ali ; Ng, Choon-Ching

Author_Institution

Faculty of Computer Science and Information, System, Universiti Teknologi Malaysia, 81310, Johor Bharu, Malaysia

Volume

fYear

2008

fDate

26-28 Aug. 2008

Firstpage

Lastpage

Abstract

In general, a web document page may contain several script forms. Each script can be used for constructing different languages. Determining the languages of the document is the required to effectively be able to apply many search and information retrieval techniques. In this work, we propose hybrid-grams feature selection methods by integrating unigram and bigrams. The method makes use of local statistical information or data within a document to determine the language. From the experiments, we have noticed that hybrid-grams are outperformed than unigram and bigrams in Cyrillic and Indic script language identifications.

Keywords

Adaptive systems; Computer science; Data mining; Electronic mail; Encoding; Feature extraction; Information retrieval; Internet; Natural languages; Statistics;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology, 2008. ITSim 2008. International Symposium on

Conference_Location

Kuala Lumpur, Malaysia

Print_ISBN

978-1-4244-2327-9

Electronic_ISBN

978-1-4244-2328-6

Type

conf

DOI

10.1109/ITSIM.2008.4631694

Filename

4631694

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=476745