• DocumentCode
    2651791
  • Title

    Similarity Calculation with Length Delimiting Dictionary Distance

  • Author

    Burkovski, Andre ; Klenk, Sebastian ; Heidemann, Gunther

  • Author_Institution
    Dept. for Intell. Syst., Univ. of Stuttgart, Stuttgart, Germany
  • fYear
    2011
  • fDate
    7-9 Nov. 2011
  • Firstpage
    856
  • Lastpage
    864
  • Abstract
    The Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression programs such as gzip to compute similarity between objects. NCD has unique features: It does not require any prior knowledge, data preprocessing, feature extraction, domain adaptation or any parameter settings. Further, the NCD can be applied to symbolic data and raw signals alike. In this paper we decompose the NCD and introduce a method to measure compression-based similarity without the need to use compression. The Length Delimiting Dictionary Distance (LD3) takes the one component essential in compression methods, the dictionary generation, and strips the NCD of all dispensable components. The LD3 performs "compression based pattern recognition without compression", keeping all of the above benefits of the NCD while achieving better speed and recognition rates. We first review the NCD, introduce LD3 as the "essence" of NCD, and evaluate the LD3 based on language tree experiments, authorship recognition, and genome phylogeny data.
  • Keywords
    data mining; dictionaries; pattern recognition; trees (mathematics); NCD; compression-based similarity; feature extraction; genome phylogeny data; language tree experiments; length delimiting dictionary distance; normalized compression distance; parameter-free data mining; pattern recognition; Complexity theory; Compression algorithms; Compressors; Dictionaries; Image coding; Measurement; Pattern recognition; dictionary-based compression; normalized compression distance; parameter-free data mining; pattern recognition; similarity metric;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on
  • Conference_Location
    Boca Raton, FL
  • ISSN
    1082-3409
  • Print_ISBN
    978-1-4577-2068-0
  • Electronic_ISBN
    1082-3409
  • Type

    conf

  • DOI
    10.1109/ICTAI.2011.133
  • Filename
    6103424