DocumentCode
2477831
Title
A source coding approach to classification by vector quantization and the principle of minimum description length
Author
Li, Jia
Author_Institution
Dept. of Stat., Pennsylvania State Univ., University Park, PA, USA
fYear
2002
fDate
2002
Firstpage
382
Lastpage
391
Abstract
An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data {(Xi, Yi)}i=1n, which are independent samples from a joint distribution PXY. Based on the principle of minimum description length (MDL), a statistical model that approximates the distribution PXY ought to enable efficient coding of X and Y. On the other hand, we expect a system that encodes (X, Y) efficiently to provide ample information on the distribution PXY. This information can then be used to classify X, i.e., to predict the corresponding Y based on X. To encode both X and Y, a two-stage vector quantizer is applied to X and a Huffman code is formed for Y conditioned on each quantized value of X. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of Y given X, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely discriminant vector quantization (DVQ), is compared with learning vector quantization (LVQ) and CARTR on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.
Keywords
Bayes methods; Huffman codes; entropy codes; optimisation; pattern classification; sampling methods; source coding; vector quantisation; Bayes classification rule; DVQ; Huffman code; MDL; conditional distribution; density estimation; discriminant vector quantization; encoder optimization; entropy coding; independent samples; joint distribution; minimum description length; misclassification rate; quantization error; regression; source coding; statistical model; supervised classification; training data; two-stage vector quantizer; Clustering algorithms; Data compression; Probability; Prototypes; Random variables; Source coding; Statistics; Testing; Training data; Vector quantization;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Compression Conference, 2002. Proceedings. DCC 2002
ISSN
1068-0314
Print_ISBN
0-7695-1477-4
Type
conf
DOI
10.1109/DCC.2002.999978
Filename
999978
Link To Document