Title of article :
AProbabilistic Model for Latent Semantic Indexing
Author/Authors :
Chris H.Q. Ding، نويسنده ,
Issue Information :
ماهنامه با شماره پیاپی سال 2005
Abstract :
Latent Semantic Indexing (LSI), when applied to semantic
space built on text collections, improves information
retrieval, information filtering, and word sense disambiguation.
Anew dual probability model based on the
similarity concepts is introduced to provide deeper understanding
of LSI. Semantic associations can be quantitatively
characterized by their statistical significance,
the likelihood. Semantic dimensions containing redundant
and noisy information can be separated out and
should be ignored because their negative contribution to
the overall statistical significance. LSI is the optimal
solution of the model. The peak in the likelihood curve
indicates the existence of an intrinsic semantic dimension.
The importance of LSI dimensions follows the
Zipf-distribution, indicating that LSI dimensions represent
latent concepts. Document frequency of words follows
the Zipf distribution, and the number of distinct
words follows log-normal distribution. Experiments on
five standard document collections confirm and illustrate
the analysis
Journal title :
Journal of the American Society for Information Science and Technology
Journal title :
Journal of the American Society for Information Science and Technology