Title of article :
Long distance bigram models applied to word clustering
Author/Authors :
Bassiou، نويسنده , , Nikoletta and Kotropoulos، نويسنده , , Constantine، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2011
Pages :
14
From page :
145
To page :
158
Abstract :
Two novel word clustering techniques are proposed which employ long distance bigram language models. The first technique is built on a hierarchical clustering algorithm and minimizes the sum of Mahalanobis distances of all words after a cluster merger from the centroid of the class created by merging. The second technique resorts to probabilistic latent semantic analysis (PLSA). Next, interpolated long distance bigrams are considered in the context of the aforementioned clustering techniques. Experiments conducted on the English Gigaword corpus (second edition) demonstrate that: (1) the long distance bigrams, when employed in the two clustering techniques under study, yield word clusters of better quality than the baseline bigrams; (2) the interpolated long distance bigrams outperform the long distance bigrams in the same respect; (3) the long distance bigrams perform better than the bigrams, which incorporate trigger-pairs selected at various distances; and (4) the best word clustering is achieved by the PLSA that employs interpolated long distance bigrams. Both proposed techniques outperform spectral clustering based on k-means. To assess objectively the quality of the created clusters, relative cluster validity indices are estimated as well as the average cluster sense precision, the average cluster sense recall, and the F-measure are computed by exploiting ground truth extracted from the WordNet.
Keywords :
Language modeling , Distance bigrams , Trigger-pairs , Cluster dispersion , Probabilistic latent semantic analysis , Spectral clustering , Cluster sense recall , wordnet , Word clustering , Relative cluster validity indices , Cluster sense precision
Journal title :
PATTERN RECOGNITION
Serial Year :
2011
Journal title :
PATTERN RECOGNITION
Record number :
1733886
Link To Document :
بازگشت