• DocumentCode
    3100810
  • Title

    Identification of Document Language is Not yet a Completely Solved Problem

  • Author

    da Silva, Joaquim Ferreira ; Lopes, Gabriel Pereira

  • Author_Institution
    Dept. de Informtica, Univ. Nova de Lisboa, Caparica
  • fYear
    2006
  • fDate
    Nov. 28 2006-Dec. 1 2006
  • Firstpage
    212
  • Lastpage
    212
  • Abstract
    Existing Language Identification (LID) approaches do reach 100% precision, in most common situations, when dealing with documents written in just one language, and when those documents are large enough. However, LID approaches do not provide a reliable solution for some situations: when there is need to discriminate the correct variant of the language used in a text, for example, European or Brazilian variants of Portuguese, UK or USA English variants, or any other language variants. Another hard context occur with small touristic advertisements on the web, addressing foreigners but using local language to name most local entities. In this paper, we present a fully statistics- based LID approach which learns the most discriminant information according to each context, and identifies the correct language or language variant a text is written in. This methodology is shown to be correct for normal texts and maintains its robustness in hard LID contexts.
  • Keywords
    Internet; natural language processing; text analysis; Brazilian variants; European variants; Portuguese variants; UK variants; USA English variants; World Wide Web; discriminant information; document language; language identification; language variants; touristic advertisements; Computational intelligence; Contracts; Europe; Frequency; Natural languages; Probability; Robustness; Statistics; Text categorization; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence for Modelling, Control and Automation, 2006 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    0-7695-2731-0
  • Type

    conf

  • DOI
    10.1109/CIMCA.2006.117
  • Filename
    4052828