Title :
Text Classification via iVector Based Feature Representation
Author :
Shengxin Zha ; Xujun Peng ; Huaigu Cao ; Xiandan Zhuang ; Natarajan, Prem ; Natarajan, Prem
Author_Institution :
Dept. of EECS, Northwestern Univ., Evanston, IL, USA
Abstract :
In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.
Keywords :
Gaussian processes; document image processing; feature extraction; handwritten character recognition; image classification; image representation; mixture models; optical character recognition; support vector machines; text detection; Arabic document database; GMM; Gaussian mixture model; OCR; UBM; WER; degraded noisy documents; discriminative classifier; feature space; fixed-length feature vector representation; handwritten text classification; high-dimensional super vector; historical typewritten text classification; iVector based feature representation; iVector space; iVector-SVM classifier; modern machine-printed text classification; optical character recognition; relative word error rate; support vector machine; text dependent component; text line classification; universal background model; Feature extraction; Gaussian mixture model; Hidden Markov models; Optical character recognition software; Support vector machines; Training; Vectors;
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
DOI :
10.1109/DAS.2014.10