Title :
Effects of Similarity Metrics on Document Clustering
Author :
Taghva, Kazem ; Veni, Rushikesh
Author_Institution :
Dept. of Comput. Sci., Univ. of Nevada, Las Vegas, NV, USA
Abstract :
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator. In these papers, we evaluate the effects of many similarity functions on k-mean clustering algorithm. Based on our analysis, we conclude that Chi-Square works best for the document collection with efficiency around 80% followed by Canberra and Euclidean distances with 70%. The results also indicate that the distance metrics like Bray-Curtis, Variational and Trigonometric function didn´t produce good results.
Keywords :
document handling; geometry; pattern clustering; software metrics; Canberra distances; Chi-Square; Euclidean distances; cosine measures; document clustering; dot product; k-mean clustering algorithm; similarity functions; similarity metrics; unsupervised document classification; Clustering algorithms; Computer science; Convergence; Euclidean distance; Information retrieval; Information technology; Partitioning algorithms; Unsupervised learning; Clustering; distance function; k-mean; unsupervised learning;
Conference_Titel :
Information Technology: New Generations (ITNG), 2010 Seventh International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-6270-4
DOI :
10.1109/ITNG.2010.65