Title :
Wikipedia-Based Semantic Similarity Measurements for Noisy Short Texts Using Extended Naive Bayes
Author :
Shirakawa, Masumi ; Nakayama, Kotaro ; Hara, Takahiro ; Nishio, Shojiro
Author_Institution :
Dept. of Multimedia Eng., Osaka Univ., Suita, Japan
Abstract :
This paper proposes a Wikipedia-based semantic similarity measurement method that is intended for real-world noisy short texts. Our method is a kind of explicit semantic analysis (ESA), which adds a bag of Wikipedia entities (Wikipedia pages) to a text as its semantic representation and uses the vector of entities for computing the semantic similarity. Adding related entities to a text, not a single word or phrase, is a challenging practical problem because it usually consists of several subproblems, e.g., key term extraction from texts, related entity finding for each key term, and weight aggregation of related entities. Our proposed method solves this aggregation problem using extended naive Bayes, a probabilistic weighting mechanism based on the Bayes´ theorem. Our method is effective especially when the short text is semantically noisy, i.e., they contain some meaningless or misleading terms for estimating their main topic. Experimental results on Twitter message and Web snippet clustering revealed that our method outperformed ESA for noisy short texts. We also found that reducing the dimension of the vector to representative Wikipedia entities scarcely affected the performance while decreasing the vector size and hence the storage space and the processing time of computing the cosine similarity.
Keywords :
Bayes methods; Web sites; pattern clustering; probability; text analysis; ESA; Twitter message; Web snippet clustering; Wikipedia entities; Wikipedia-based semantic similarity measurements; cosine similarity; explicit semantic analysis; extended naive Bayes; key term extraction; meaningless terms; misleading terms; noisy short texts; probabilistic weighting mechanism; representative Wikipedia entities; semantic representation; storage space; vector size; Electronic publishing; Encyclopedias; Internet; Noise measurement; Semantics; Vectors; Naive Bayes; Semantic similarity; naive Bayes; semantic representation; semantic representation,; short text clustering;
Journal_Title :
Emerging Topics in Computing, IEEE Transactions on
DOI :
10.1109/TETC.2015.2418716