DocumentCode
476745
Title
Unigram language identifications using adaptive neutral network
Author
Selamat, Ali ; Ng, Choon-Ching
Author_Institution
Faculty of Computer Science and Information, System, Universiti Teknologi Malaysia, 81310, Johor Bharu, Malaysia
Volume
2
fYear
2008
fDate
26-28 Aug. 2008
Firstpage
1
Lastpage
5
Abstract
In general, a web document page may contain several script forms. Each script can be used for constructing different languages. Determining the languages of the document is the required to effectively be able to apply many search and information retrieval techniques. In this work, we propose hybrid-grams feature selection methods by integrating unigram and bigrams. The method makes use of local statistical information or data within a document to determine the language. From the experiments, we have noticed that hybrid-grams are outperformed than unigram and bigrams in Cyrillic and Indic script language identifications.
Keywords
Adaptive systems; Computer science; Data mining; Electronic mail; Encoding; Feature extraction; Information retrieval; Internet; Natural languages; Statistics;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology, 2008. ITSim 2008. International Symposium on
Conference_Location
Kuala Lumpur, Malaysia
Print_ISBN
978-1-4244-2327-9
Electronic_ISBN
978-1-4244-2328-6
Type
conf
DOI
10.1109/ITSIM.2008.4631694
Filename
4631694
Link To Document