Title :
Document vector compression and its application in document clustering
Author_Institution :
Intelligent Engines, Calgary Univ., Alta.
Abstract :
Document clustering organizes documents into groups such that each group contains documents with similar content. The majority of document clustering algorithms require a vector representation for each document. Each vector has well over 10,000 elements. Consequently, the memory required during clustering can be extremely high when clustering hundreds of thousands of documents. This paper introduces document vector compression, which is based on the discrete cosine transform (DCT). Document vector compression reduces the run-time memory requirements by as much as 60%. Document vector compression does not degrade the final cluster quality (total F-measure) as does other document vector reduction techniques
Keywords :
data compression; discrete cosine transforms; document image processing; image coding; image representation; DCT; discrete cosine transform; document clustering algorithms; document vector compression; run-time memory requirements; vector representation; Arithmetic; Clustering algorithms; Compaction; Data mining; Degradation; Discrete cosine transforms; Engines; Frequency; Information retrieval; Runtime;
Conference_Titel :
Electrical and Computer Engineering, 2005. Canadian Conference on
Conference_Location :
Saskatoon, Sask.
Print_ISBN :
0-7803-8885-2
DOI :
10.1109/CCECE.2005.1557384