• DocumentCode
    174894
  • Title

    Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

  • Author

    Rott, Michal ; Cerva, Petr

  • Author_Institution
    Inst. of Inf. Technol. & Electron., Tech. Univ. of Liberec, Liberec, Czech Republic
  • fYear
    2014
  • fDate
    1-5 Sept. 2014
  • Firstpage
    223
  • Lastpage
    227
  • Abstract
    This paper studies the use of Latent Semantic Analysis (LSA) for automatic clustering of Czech news articles. We show that LSA is capable of yielding good results in this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which belongs to a group of highly inflective and morphologicallyrich languages. The experimental evaluation of our clustering scheme and investigation of LSA is performed on query-and category-based test sets. The obtained results demonstrate that the automatic system yields values of the Rand index that are absolutely lower -- by 20% -- than the accuracy of human cluster annotations. We also show which similarity metric should be used for cluster merging and the effect of dimension reduction on clustering accuracy.
  • Keywords
    electronic publishing; merging; natural language processing; pattern clustering; query processing; LSA; Rand index; automatic Czech news article clustering; category-based test sets; cluster merging; clustering accuracy; dimension reduction; latent semantic analysis; morphologically-rich languages; query-based test sets; similarity metric; synonymy; Accuracy; Indexes; Matrix decomposition; Measurement; Semantics; Silicon; Vectors; clustering; latent semantic analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Applications (DEXA), 2014 25th International Workshop on
  • Conference_Location
    Munich
  • ISSN
    1529-4188
  • Print_ISBN
    978-1-4799-5721-7
  • Type

    conf

  • DOI
    10.1109/DEXA.2014.54
  • Filename
    6974853