Language identification: A new fast algorithm to identify the language of a text in a multilingual corpus

Author

Gadri, Said ; Moussaoui, Abdelouahab ; Belabdelouahab-Fernini, Linda

Author_Institution

Dept. of ICST, Univ. of M´sila, M´sila, Algeria

fYear

2014

fDate

14-16 April 2014

Firstpage

321

Lastpage

326

Abstract

Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In this paper, we present the most known identification algorithms, and we propose a new fast and effective algorithm based on n-grams of characters. We also evaluate the obtained results with other algorithms when using the two approaches of texts segmentation: words approach, n-grams approach.

Keywords

natural language processing; text analysis; Arabic language; Chinese language; English language; French language; German language; information retrieval; language identification; multilingual corpus; multilingual document categorization; n-grams approach; texts segmentation; words approach; Computers; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-gram; Text Mining; language identification; machine learning; text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Multimedia Computing and Systems (ICMCS), 2014 International Conference on

Conference_Location

Marrakech

Print_ISBN

978-1-4799-3823-0

Type

conf

DOI

10.1109/ICMCS.2014.6911338

Filename

6911338