DocumentCode
3717227
Title
Scalable k-NN based text clustering
Author
Alessandro Lulli;Thibault Debatty;Matteo Dell´Amico;Pietro Michiardi;Laura Ricci
Author_Institution
University of Pisa, Italy
fYear
2015
Firstpage
958
Lastpage
963
Abstract
Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.
Keywords
"Approximation algorithms","Clustering algorithms","Measurement","Approximation methods","Algorithm design and analysis","Scalability","Electronic mail"
Publisher
ieee
Conference_Titel
Big Data (Big Data), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/BigData.2015.7363845
Filename
7363845
Link To Document