DocumentCode :
174895
Title :
Robust Language Identification of Noisy Texts: Proposal of Hybrid Approaches
Author :
Abainia, K. ; Ouamour, S. ; Sayoud, H.
Author_Institution :
USTHB Univ., Algiers, Algeria
fYear :
2014
fDate :
1-5 Sept. 2014
Firstpage :
228
Lastpage :
232
Abstract :
This paper deals with the problem of automatic language identification of noisy texts, which represents an important task in natural language processing. Actually, there exist several works in this field, which are based on statistical and machine learning approaches for different categories of texts. Unfortunately, most of the proposed methods work fine on clean texts and/or long texts, but often present a failure when the text is corrupted or too short. In this research work, we use a typical dataset consisting of short texts collected from several discussion forums containing several types of noises. Our dataset contains 32 different languages, where we notice that some languages are quite different while some others are too closed. In this investigation, we propose two types of methods to identify the text language: term-based method and character-based method. Moreover, we propose two hybrid methods to enhance the performances of those techniques. Experiments show that the proposed hybrid methods are quite interesting and present good language identification performances in noisy texts.
Keywords :
natural language processing; text analysis; automatic language identification; character-based method; natural language processing; noisy texts; term-based method; Conferences; Databases; Expert systems; Automatic Language Identification; Hybrid Approach; Natural Language Processing; Noisy Text; Text categorizationn;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
Conference_Location :
Munich
ISSN :
1529-4188
Print_ISBN :
978-1-4799-5721-7
Type :
conf
DOI :
10.1109/DEXA.2014.55
Filename :
6974854
Link To Document :
بازگشت