• DocumentCode
    1330866
  • Title

    Reducing the Loss of Information through Annealing Text Distortion

  • Author

    Granados, Ana ; Cebrian, Manuel ; Camacho, David ; de Borja Rodriguez, Francisco

  • Author_Institution
    Escuela Polite´´cnica Super., Univ. Autonoma de Madrid, Madrid, Spain
  • Volume
    23
  • Issue
    7
  • fYear
    2011
  • fDate
    7/1/2011 12:00:00 AM
  • Firstpage
    1090
  • Lastpage
    1102
  • Abstract
    Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.
  • Keywords
    data mining; pattern clustering; text analysis; Lempel-Ziv compression; annealing text distortion; block-sorting compression; compression distances; compression families; compression-based text clustering; data mining; data sets; information distortion; knowledge discovery; nondistorted text clustering; statistical compression; Clustering algorithms; Complexity theory; Compression algorithms; Data compression; Distortion measurement; Information analysis; Upper bound; Information distortion; Kolmogorov complexity.; clustering by compression; data compression; normalized compression distance;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.173
  • Filename
    5582094