DocumentCode
2508093
Title
The organisation and visualisation of document corpora: a probabilistic approach
Author
Girolami, Mark ; Vinokourov, Alexei ; Kabán, Ata
Author_Institution
Dept. of Comput. & Inf. Syst., Paisley Univ., UK
fYear
2000
fDate
2000
Firstpage
558
Lastpage
564
Abstract
A generic probabilistic framework for the unsupervised organisation and visualisation of document collections is presented. The probabilistic hierarchical clustering of large-scale sparse and high-dimensional data collections is achieved by the development of a family of latent class models which are parameterized using the expectation maximisation algorithm. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically, both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. The subsequent visualisation of document collections is achieved by exploiting the topographic relations between similar documents. A latent trait model is developed which provides the means of viewing vector space document representations on a 2D grid and thereby visualising the inherent structure of the document collection. A number of experiments are provided to demonstrate the technique and a concluding discussion on the proposed models is given
Keywords
data visualisation; document handling; optimisation; probability; topology; 2D grid; asymmetric models; binomial distributions; document collections; document corpora visualisation; expectation maximisation algorithm; generic probabilistic framework; hierarchical probabilistic mixture methodology; high-dimensional data collections; latent class models; latent trait model; multinomial distribution; probabilistic approach; probabilistic hierarchical clustering; similar documents; symmetric models; text data; topographic relations; unsupervised organisation; vector space document representations; Clustering algorithms; Computational intelligence; Costs; Databases; Information retrieval; Information systems; Internet; Large-scale systems; Parameter estimation; Visualization;
fLanguage
English
Publisher
ieee
Conference_Titel
Database and Expert Systems Applications, 2000. Proceedings. 11th International Workshop on
Conference_Location
London
ISSN
1529-4188
Print_ISBN
0-7695-0680-1
Type
conf
DOI
10.1109/DEXA.2000.875081
Filename
875081
Link To Document