Title :
Using Mahout for Clustering Wikipedia´s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud
Author :
Esteves, Rui Máximo ; Rong, Chunming
Author_Institution :
Dept. of Electr. & Comput. Eng., Univ. of Stavanger, Stavanger, Norway
fDate :
Nov. 29 2011-Dec. 1 2011
Abstract :
This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia´s latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.
Keywords :
Web sites; cloud computing; document handling; fuzzy set theory; pattern clustering; Apache Mahout; Hadoop; Wikipedia latest article clustering; artificial datasets; cluster quality; free cloud computing solution; fuzzy c-means clustering; k-means clustering; noisy realistic dataset; real document clustering; Clustering algorithms; Convergence; Electronic publishing; Encyclopedias; Internet; Vectors; Mahout; document clustering; fuzzy c-means; k-means;
Conference_Titel :
Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on
Conference_Location :
Athens
Print_ISBN :
978-1-4673-0090-2
DOI :
10.1109/CloudCom.2011.86