Author :
Cernian, Alexandra ; Carstoiu, Dorin ; Sgarciu, Valentin
Author_Institution :
Fac. of Autom. Control & Comput. Sci., Politeh. Univ. of Bucharest, Bucharest, Romania
Abstract :
Notice of Violation of IEEE Publication Principles
"Improving Heterogeneous Data Clustering by Using Metadata and Compression Algorithms"
by Alexandra Cernian, Dorin Carstoiu, Valentin Sgarciu,
in the Proceedings of the 2010 Roedunet International Conference (RoEduNet),June 2010, pp.169-173
After careful and considered review of the content and authorship of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE\´s Publication Principles.
This paper contains portions of text from the paper(s) cited below. A credit notice is used, but due to the absence of quotation marks or offset text, copied material is not clearly referenced or specifically identified.
"Etude des Methodes de Classification par Compression"
by Tudor Basarab IONESCU,
published in Rapport interne 2005-06-28-DI-FB
http://wwwdi.supelec.fr/fb/download/Articles/Rapport_2005-06-28-DI-FB.pdf
Nowadays, we have to deal with a large quantity of unstructured, heterogeneous data, produced by an increasing number of sources. Clustering heterogeneous data is essential to getting structured information in response to user queries. In this paper, we assess the results of a new clustering technique - clustering by compression - when applied to metadata associated with heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Experimental results show that using metadata could improve the average clustering performances with about 20% over clustering the same sample data set without using metadata.
Keywords :
data compression; meta data; pattern clustering; compression algorithms; heterogeneous data clustering; metadata; normalized compression distance; sample data set; Automatic control; Clustering algorithms; Compression algorithms; Data mining; Internet; Keyword search; clustering by compression; heterogeneous data; metadata; normalized compression distance;