Title :
The organisation and visualisation of document corpora: a probabilistic approach
Author :
Girolami, Mark ; Vinokourov, Alexei ; Kabán, Ata
Author_Institution :
Dept. of Comput. & Inf. Syst., Paisley Univ., UK
Abstract :
A generic probabilistic framework for the unsupervised organisation and visualisation of document collections is presented. The probabilistic hierarchical clustering of large-scale sparse and high-dimensional data collections is achieved by the development of a family of latent class models which are parameterized using the expectation maximisation algorithm. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically, both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. The subsequent visualisation of document collections is achieved by exploiting the topographic relations between similar documents. A latent trait model is developed which provides the means of viewing vector space document representations on a 2D grid and thereby visualising the inherent structure of the document collection. A number of experiments are provided to demonstrate the technique and a concluding discussion on the proposed models is given
Keywords :
data visualisation; document handling; optimisation; probability; topology; 2D grid; asymmetric models; binomial distributions; document collections; document corpora visualisation; expectation maximisation algorithm; generic probabilistic framework; hierarchical probabilistic mixture methodology; high-dimensional data collections; latent class models; latent trait model; multinomial distribution; probabilistic approach; probabilistic hierarchical clustering; similar documents; symmetric models; text data; topographic relations; unsupervised organisation; vector space document representations; Clustering algorithms; Computational intelligence; Costs; Databases; Information retrieval; Information systems; Internet; Large-scale systems; Parameter estimation; Visualization;
Conference_Titel :
Database and Expert Systems Applications, 2000. Proceedings. 11th International Workshop on
Conference_Location :
London
Print_ISBN :
0-7695-0680-1
DOI :
10.1109/DEXA.2000.875081