Title :
Semantic Similarity Measurements for Multi-lingual Short Texts Using Wikipedia
Author :
Nakamura, T. ; Shirakawa, Masumi ; Hara, Tenshi ; Nishio, Shojiro
Author_Institution :
Dept. of Multimedia Eng., Osaka Univ., Suita, Japan
Abstract :
In this paper, we propose two methods to measure the semantic similarity for multi-lingual and short texts by using Wikipedia. In recent years, people around the world have been continuously generating information about their local area in their own languages on social networking services. Measuring the similarity between the texts is challenging because they are often short and written in various languages. Our methods solve this problem by incorporating inter-language links of Wikipedia into extended naive Bayes (ENB), a probabilistic method of semantic similarity measurements for short texts. The proposed methods represent a multi-lingual short text as a vector of the English version of Wikipedia articles (entities). We conducted an experiment on clustering of tweets written in four languages (English, Spanish, Japanese and Arabic). From the experimental results, we confirmed that our methods outperformed cross-lingual explicit semantic analysis (CL-ESA), which is a method to measure the similarity between texts written in two different languages. Moreover, our methods were competitive with ENB applied to texts that have been translated into English using Google Translate. Our methods enabled similarity measurements for multi-lingual short texts without the cost of machine translations.
Keywords :
Bayes methods; natural language processing; social networking (online); text analysis; Arabic language; CL-ESA; ENB; English language; English version; Google Translate; Japanese language; Spanish language; Wikipedia articles; Wikipedia entities; cross-lingual explicit semantic analysis; extended naive Bayes; interlanguage links; multilingual short text; probabilistic method; semantic similarity measurements; social networking services; tweets clustering; vector; Electronic publishing; Encyclopedias; Internet; Probabilistic logic; Semantics; Vectors;
Conference_Titel :
Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on
Conference_Location :
Warsaw
DOI :
10.1109/WI-IAT.2014.76