Title :
Extracting semantic prototypes and factual information from a large scale corpus using variable size window topic modelling
Author :
Korzycki, Michal ; Korczynski, Wojciech
Author_Institution :
AGH Univ. of Sci. & Technol. in Krakow, Kraków, Poland
Abstract :
In this paper a model of textual events composed of a mixture of semantic stereotypes and factual information is proposed. A method is introduced that enables distinguishing automatically semantic prototypes of a general nature describing general categories of events from factual elements specific to a given event. Next, this paper presents the results of an experiment of unsupervised topic extraction performed on documents from a large-scale corpus with an additional temporal structure. This experiment was realized as a comparison of the nature of information provided by Latent Dirichlet Allocation and Vector Space modelling based on Log-Entropy weights. The impact of using different time windows of the corpus on the results of topic modelling is presented. Finally, a discussion is suggested on the issue if unsupervised topic modelling may reflect deeper semantic information, such as elements describing a given event or its causes and results, and discern it from pure factual data.
Keywords :
entropy; information retrieval; text analysis; vectors; factual elements; factual information extraction; large scale corpus; latent Dirichlet allocation; log-entropy weights; semantic prototype extraction; semantic prototypes; semantic stereotypes; temporal structure; textual events; time windows; unsupervised topic extraction; unsupervised topic modelling; variable size window topic modelling; vector space modelling; Accidents; Analytical models; Inductors; Prototypes; Semantics; Underwater vehicles;
Conference_Titel :
Computer Science and Information Systems (FedCSIS), 2014 Federated Conference on
Conference_Location :
Warsaw