Title :
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles
Author :
Rott, Michal ; Cerva, Petr
Author_Institution :
Inst. of Inf. Technol. & Electron., Tech. Univ. of Liberec, Liberec, Czech Republic
Abstract :
This paper studies the use of Latent Semantic Analysis (LSA) for automatic clustering of Czech news articles. We show that LSA is capable of yielding good results in this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which belongs to a group of highly inflective and morphologicallyrich languages. The experimental evaluation of our clustering scheme and investigation of LSA is performed on query-and category-based test sets. The obtained results demonstrate that the automatic system yields values of the Rand index that are absolutely lower -- by 20% -- than the accuracy of human cluster annotations. We also show which similarity metric should be used for cluster merging and the effect of dimension reduction on clustering accuracy.
Keywords :
electronic publishing; merging; natural language processing; pattern clustering; query processing; LSA; Rand index; automatic Czech news article clustering; category-based test sets; cluster merging; clustering accuracy; dimension reduction; latent semantic analysis; morphologically-rich languages; query-based test sets; similarity metric; synonymy; Accuracy; Indexes; Matrix decomposition; Measurement; Semantics; Silicon; Vectors; clustering; latent semantic analysis;
Conference_Titel :
Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
Conference_Location :
Munich
Print_ISBN :
978-1-4799-5721-7
DOI :
10.1109/DEXA.2014.54