DocumentCode
174894
Title
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles
Author
Rott, Michal ; Cerva, Petr
Author_Institution
Inst. of Inf. Technol. & Electron., Tech. Univ. of Liberec, Liberec, Czech Republic
fYear
2014
fDate
1-5 Sept. 2014
Firstpage
223
Lastpage
227
Abstract
This paper studies the use of Latent Semantic Analysis (LSA) for automatic clustering of Czech news articles. We show that LSA is capable of yielding good results in this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which belongs to a group of highly inflective and morphologicallyrich languages. The experimental evaluation of our clustering scheme and investigation of LSA is performed on query-and category-based test sets. The obtained results demonstrate that the automatic system yields values of the Rand index that are absolutely lower -- by 20% -- than the accuracy of human cluster annotations. We also show which similarity metric should be used for cluster merging and the effect of dimension reduction on clustering accuracy.
Keywords
electronic publishing; merging; natural language processing; pattern clustering; query processing; LSA; Rand index; automatic Czech news article clustering; category-based test sets; cluster merging; clustering accuracy; dimension reduction; latent semantic analysis; morphologically-rich languages; query-based test sets; similarity metric; synonymy; Accuracy; Indexes; Matrix decomposition; Measurement; Semantics; Silicon; Vectors; clustering; latent semantic analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
Conference_Location
Munich
ISSN
1529-4188
Print_ISBN
978-1-4799-5721-7
Type
conf
DOI
10.1109/DEXA.2014.54
Filename
6974853
Link To Document