DocumentCode :
1330859
Title :
A Web Search Engine-Based Approach to Measure Semantic Similarity between Words
Author :
Bollegala, Danushka ; Matsuo, Yutaka ; Ishizuka, Mitsuru
Author_Institution :
Dept. of Electron. & Inf., Univ. of Tokyo, Tokyo, Japan
Volume :
23
Issue :
7
fYear :
2011
fDate :
7/1/2011 12:00:00 AM
Firstpage :
977
Lastpage :
990
Abstract :
Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an empirical method to estimate semantic similarity using page counts and text snippets retrieved from a web search engine for two words. Specifically, we define various word co-occurrence measures using page counts and integrate those with lexical patterns extracted from text snippets. To identify the numerous semantic relations that exist between two given words, we propose a novel pattern extraction algorithm and a pattern clustering algorithm. The optimal combination of page counts-based co-occurrence measures and lexical pattern clusters is learned using support vector machines. The proposed method outperforms various baselines and previously proposed web-based semantic similarity measures on three benchmark data sets showing a high correlation with human ratings. Moreover, the proposed method significantly improves the accuracy in a community mining task.
Keywords :
Internet; data mining; information retrieval; pattern clustering; search engines; support vector machines; text analysis; Web search engine; Web-based semantic similarity measures; benchmark data sets; community mining task; lexical pattern clusters; lexical patterns; numerous semantic relations; optimal combination; page counts-based co-occurrence measures; pattern clustering algorithm; pattern extraction algorithm; semantic similarity between words; support vector machines; text snippets; word co-occurrence measures; Context; Data mining; Information retrieval; Search engines; Semantics; Support vector machines; Text processing; Web search; Web mining; information extraction; web text analysis.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2010.172
Filename :
5582093
Link To Document :
بازگشت