DocumentCode :
2759921
Title :
On Using Metadata and Compression Algorithms to Cluster Heterogeneous Documents from a Semantic Point of View
Author :
Cernian, Alexandra ; Carstoiu, Dorin ; Sgarciu, Valentin
Author_Institution :
Fac. of Autom. Control & Comput. Sci., Politeh. Univ. of Bucharest, Bucharest, Romania
fYear :
2010
fDate :
22-27 Aug. 2010
Firstpage :
190
Lastpage :
195
Abstract :
Since data is becoming more and more unstructured, clustering heterogeneous data is essential to getting structured information in response to user queries. In this paper, we test and validate the results of a new clustering technique - clustering by compression - when applied to metadata associated with heterogeneous sets of documents. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). Experimental results show that using metadata could improve the average clustering performances with about 10% over clustering the same sample data set without using metadata.
Keywords :
document handling; meta data; pattern clustering; NCD; cluster heterogeneous documents; clustering technique; compression algorithms; compression procedure; metadata; normalized compression distance; pairwise concatenation; semantic point of view; structured information; user queries; clustering by compression; heterogeneous data; metadata; normalized compression distance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering Advances (ICSEA), 2010 Fifth International Conference on
Conference_Location :
Nice
Print_ISBN :
978-1-4244-7788-3
Electronic_ISBN :
978-0-7695-4144-0
Type :
conf
DOI :
10.1109/ICSEA.2010.36
Filename :
5615737
Link To Document :
بازگشت