Title :
Adaptive context trees and text clustering
Author :
Vert, Jean-Philippe
Author_Institution :
Dept. of Math. & Applications, Ecole Normale Superieure, Paris, France
fDate :
7/1/2001 12:00:00 AM
Abstract :
In the finite-alphabet context we propose four alternatives to fixed-order Markov models to estimate a conditional distribution. They consist in working with a large class of variable-length Markov models represented by context trees, and building an estimator of the conditional distribution with a risk of the same order as the risk of the best estimator for every model simultaneously, in a conditional Kullback-Leibler sense. Such estimators can be used to model complex objects like texts written in natural language and define a notion of similarity between them. This idea is illustrated by experimental results of unsupervised text clustering
Keywords :
Markov processes; adaptive estimation; natural languages; pattern clustering; text analysis; trees (mathematics); adaptive context trees; adaptive estimation; conditional Kullback-Leibler estimator; conditional distribution estimation; finite-alphabet context; fixed-order Markov models; natural language; similarity; strings; unsupervised text clustering; variable-length Markov models; Biomedical optical imaging; Context modeling; DNA; Indexing; Minimax techniques; Natural languages; Optical character recognition software; Sequences; Speech; Statistical distributions;
Journal_Title :
Information Theory, IEEE Transactions on