DocumentCode
256462
Title
Language identification: A new fast algorithm to identify the language of a text in a multilingual corpus
Author
Gadri, Said ; Moussaoui, Abdelouahab ; Belabdelouahab-Fernini, Linda
Author_Institution
Dept. of ICST, Univ. of M´sila, M´sila, Algeria
fYear
2014
fDate
14-16 April 2014
Firstpage
321
Lastpage
326
Abstract
Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In this paper, we present the most known identification algorithms, and we propose a new fast and effective algorithm based on n-grams of characters. We also evaluate the obtained results with other algorithms when using the two approaches of texts segmentation: words approach, n-grams approach.
Keywords
natural language processing; text analysis; Arabic language; Chinese language; English language; French language; German language; information retrieval; language identification; multilingual corpus; multilingual document categorization; n-grams approach; texts segmentation; words approach; Computers; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-gram; Text Mining; language identification; machine learning; text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Multimedia Computing and Systems (ICMCS), 2014 International Conference on
Conference_Location
Marrakech
Print_ISBN
978-1-4799-3823-0
Type
conf
DOI
10.1109/ICMCS.2014.6911338
Filename
6911338
Link To Document