• DocumentCode
    2379421
  • Title

    A quality-threshold data summarization algorithm

  • Author

    Ha-Thuc, Viet ; Nguyen, Duc-Cuong ; Srinivasan, Padmini

  • Author_Institution
    Comput. Sci. Dept., Univ. of Iowa, Iowa City, IA
  • fYear
    2008
  • fDate
    13-17 July 2008
  • Firstpage
    240
  • Lastpage
    246
  • Abstract
    As database sizes increase, semantic data summarization techniques have been developed, so that data mining algorithms can be run on the summarized set for the sake of efficiency. Clustering algorithms such as K-Means have popularly been used as semantic summarization methods where cluster centers become the summarized set. The goal of semantic summarization is to provide a summarized view of the original dataset such that the summarization ratio is maximized while the error (i.e., information loss) is minimized. This paper presents a new clustering-based data summarization algorithm, in which the quality of the summarized set can be controlled. The algorithm partitions a dataset into a number of clusters until the distortion of each cluster is less than a given threshold, thus guaranteeing the summarized set has less than a fixed amount of information loss. Based on the threshold, the number of clusters is automatically determined. The proposed algorithm, unlike traditional K-Means, adjusts initial centers based on the information about the data space discovered so far, thus significantly alleviating the local optimum effect. Our experiments show that our algorithm generates higher quality clusters than K-Means does and it also guarantees an error bound, an essential criterion for data summarization.
  • Keywords
    data handling; data mining; pattern clustering; K-means algorithm; clustering-based data summarization algorithm; data mining algorithm; quality-threshold data summarization; semantic data summarization technique; Automatic control; Cities and towns; Clustering algorithms; Computer science; Data engineering; Data mining; Databases; Information science; Libraries; Partitioning algorithms; Data Summarization (or Compression); K-Means Clustering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Research, Innovation and Vision for the Future, 2008. RIVF 2008. IEEE International Conference on
  • Conference_Location
    Ho Chi Minh City
  • Print_ISBN
    978-1-4244-2379-8
  • Electronic_ISBN
    978-1-4244-2380-4
  • Type

    conf

  • DOI
    10.1109/RIVF.2008.4586362
  • Filename
    4586362