DocumentCode :
1026996
Title :
The Google Similarity Distance
Author :
Cilibrasi, Rudi L. ; Vitányi, Paul M B
Author_Institution :
CWI, Amsterdam
Volume :
19
Issue :
3
fYear :
2007
fDate :
3/1/2007 12:00:00 AM
Firstpage :
370
Lastpage :
383
Abstract :
Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories
Keywords :
classification; database management systems; search engines; support vector machines; Google similarity distance; Kolmogorov complexity; WordNet database; binary classification; database management system; hierarchical classification; hierarchical clustering; search engine; support vector machine; Books; Data mining; Databases; Earth; Natural languages; Painting; Search engines; Support vector machines; Web sites; World Wide Web; Accuracy comparison with WordNet categories; Google code; Google distribution via page hit counts; Google search; Kolmogorov complexity; automatic classification and clustering; automatic meaning discovery using Google; automatic relative semantics; automatic translation; dissimilarity semantic distance; meaning of words and phrases extracted from the Web; normalized Google distance (ngd); normalized compression distance (ncd); normalized information distance (nid); parameter-free data mining; universal similarity metric.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2007.48
Filename :
4072748
Link To Document :
بازگشت