Title :
On Using Metadata and Compression Algorithms to Cluster Heterogeneous Documents from a Semantic Point of View
Author :
Cernian, Alexandra ; Carstoiu, Dorin ; Sgarciu, Valentin
Author_Institution :
Fac. of Autom. Control & Comput. Sci., Politeh. Univ. of Bucharest, Bucharest, Romania
Abstract :
Since data is becoming more and more unstructured, clustering heterogeneous data is essential to getting structured information in response to user queries. In this paper, we test and validate the results of a new clustering technique - clustering by compression - when applied to metadata associated with heterogeneous sets of documents. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Experimental results show that using metadata could improve the average clustering performances with about 10% over clustering the same sample data set without using metadata.
Keywords :
document handling; meta data; pattern clustering; NCD; cluster heterogeneous documents; clustering technique; compression algorithms; compression procedure; metadata; normalized compression distance; pairwise concatenation; semantic point of view; structured information; user queries; clustering by compression; heterogeneous data; metadata; normalized compression distance;
Conference_Titel :
Software Engineering Advances (ICSEA), 2010 Fifth International Conference on
Conference_Location :
Nice
Print_ISBN :
978-1-4244-7788-3
Electronic_ISBN :
978-0-7695-4144-0
DOI :
10.1109/ICSEA.2010.36