DocumentCode :
2891159
Title :
Effects of Similarity Metrics on Document Clustering
Author :
Taghva, Kazem ; Veni, Rushikesh
Author_Institution :
Dept. of Comput. Sci., Univ. of Nevada, Las Vegas, NV, USA
fYear :
2010
fDate :
12-14 April 2010
Firstpage :
222
Lastpage :
226
Abstract :
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator. In these papers, we evaluate the effects of many similarity functions on k-mean clustering algorithm. Based on our analysis, we conclude that Chi-Square works best for the document collection with efficiency around 80% followed by Canberra and Euclidean distances with 70%. The results also indicate that the distance metrics like Bray-Curtis, Variational and Trigonometric function didn´t produce good results.
Keywords :
document handling; geometry; pattern clustering; software metrics; Canberra distances; Chi-Square; Euclidean distances; cosine measures; document clustering; dot product; k-mean clustering algorithm; similarity functions; similarity metrics; unsupervised document classification; Clustering algorithms; Computer science; Convergence; Euclidean distance; Information retrieval; Information technology; Partitioning algorithms; Unsupervised learning; Clustering; distance function; k-mean; unsupervised learning;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Technology: New Generations (ITNG), 2010 Seventh International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-6270-4
Type :
conf
DOI :
10.1109/ITNG.2010.65
Filename :
5501469
Link To Document :
بازگشت