• DocumentCode
    3717227
  • Title

    Scalable k-NN based text clustering

  • Author

    Alessandro Lulli;Thibault Debatty;Matteo Dell´Amico;Pietro Michiardi;Laura Ricci

  • Author_Institution
    University of Pisa, Italy
  • fYear
    2015
  • Firstpage
    958
  • Lastpage
    963
  • Abstract
    Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.
  • Keywords
    "Approximation algorithms","Clustering algorithms","Measurement","Approximation methods","Algorithm design and analysis","Scalability","Electronic mail"
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/BigData.2015.7363845
  • Filename
    7363845