DocumentCode :
256462
Title :
Language identification: A new fast algorithm to identify the language of a text in a multilingual corpus
Author :
Gadri, Said ; Moussaoui, Abdelouahab ; Belabdelouahab-Fernini, Linda
Author_Institution :
Dept. of ICST, Univ. of M´sila, M´sila, Algeria
fYear :
2014
fDate :
14-16 April 2014
Firstpage :
321
Lastpage :
326
Abstract :
Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In this paper, we present the most known identification algorithms, and we propose a new fast and effective algorithm based on n-grams of characters. We also evaluate the obtained results with other algorithms when using the two approaches of texts segmentation: words approach, n-grams approach.
Keywords :
natural language processing; text analysis; Arabic language; Chinese language; English language; French language; German language; information retrieval; language identification; multilingual corpus; multilingual document categorization; n-grams approach; texts segmentation; words approach; Computers; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-gram; Text Mining; language identification; machine learning; text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Multimedia Computing and Systems (ICMCS), 2014 International Conference on
Conference_Location :
Marrakech
Print_ISBN :
978-1-4799-3823-0
Type :
conf
DOI :
10.1109/ICMCS.2014.6911338
Filename :
6911338
Link To Document :
بازگشت