• DocumentCode
    1791602
  • Title

    Bootstrapping K-means for big data analysis

  • Author

    Jungkyu Han ; Min Luo

  • Author_Institution
    Software Innovation Center, Nippon Telegraph & Telephone, Tokyo, Japan
  • fYear
    2014
  • fDate
    27-30 Oct. 2014
  • Firstpage
    591
  • Lastpage
    596
  • Abstract
    In recent years, “Big data” has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm´s speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.
  • Keywords
    Big Data; data analysis; data mining; statistical analysis; Big Data analysis; Hadoop; K-means algorithm; Lloyd algorithm; data mining; distributed data processing middleware; information extraction; information retrieval; statistical bootstrapping technique; Accuracy; Algorithm design and analysis; Approximation algorithms; Big data; Clustering algorithms; Sociology; Statistics; Big data; Bootstapping; Bootstrap; Clustering; k-means;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2014 IEEE International Conference on
  • Conference_Location
    Washington, DC
  • Type

    conf

  • DOI
    10.1109/BigData.2014.7004279
  • Filename
    7004279