• DocumentCode
    256462
  • Title

    Language identification: A new fast algorithm to identify the language of a text in a multilingual corpus

  • Author

    Gadri, Said ; Moussaoui, Abdelouahab ; Belabdelouahab-Fernini, Linda

  • Author_Institution
    Dept. of ICST, Univ. of M´sila, M´sila, Algeria
  • fYear
    2014
  • fDate
    14-16 April 2014
  • Firstpage
    321
  • Lastpage
    326
  • Abstract
    Identifying the language of a text is a very important preliminary phase in the categorization of multilingual documents or even in information retrieval. This phase becomes difficult if we just consider the word as a basic unit of information in texts. Because It could be possible for some languages as French or English but very difficult for some other languages as German, Chinese and Arabic. In this paper, we present the most known identification algorithms, and we propose a new fast and effective algorithm based on n-grams of characters. We also evaluate the obtained results with other algorithms when using the two approaches of texts segmentation: words approach, n-grams approach.
  • Keywords
    natural language processing; text analysis; Arabic language; Chinese language; English language; French language; German language; information retrieval; language identification; multilingual corpus; multilingual document categorization; n-grams approach; texts segmentation; words approach; Computers; Educational institutions; Pragmatics; Probability; Text categorization; Text recognition; Training; N-gram; Text Mining; language identification; machine learning; text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia Computing and Systems (ICMCS), 2014 International Conference on
  • Conference_Location
    Marrakech
  • Print_ISBN
    978-1-4799-3823-0
  • Type

    conf

  • DOI
    10.1109/ICMCS.2014.6911338
  • Filename
    6911338