Title :
Semi-Supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech
Author :
Wang, Wen ; Huang, Zhongqiang ; Harper, Mary
Author_Institution :
SRI International, Menlo Park, CA 94025, USA. wwang@speech.sri.com
Abstract :
In this paper, we investigate bootstrapping part-of-speech (POS) taggers for Mandarin broadcast news (BN) transcripts using co-training, by iteratively retraining two competitive POS taggers from a small set of labeled training data and a large set of unlabeled data. We compare co-training with self-training and our results show that the performance using co-training is significantly better than that from self-training and these semi-supervised learning methods significantly improve tagging accuracy over training only on the small labeled seed corpus. We also investigate a variety of example selection approaches for co-training and find that the computationally expensive, agreement-based selection approach and a more efficient selection approach based on maximizing training utility produce comparable tagging performance from resulting POS taggers. By applying co-training, we are able to build effective POS taggers for Mandarin transcribed speech with the tagging accuracy comparable to that obtained on newswire text.
Keywords :
learning (artificial intelligence); natural language processing; speech recognition; Mandarin broadcast news; Mandarin transcribed speech; agreement-based selection approach; bootstrapping part-of-speech; newswire text; part-of-speech tagging; semi-supervised learning; Automatic speech recognition; Broadcasting; Hydrogen; Natural language processing; Natural languages; Semisupervised learning; Speech analysis; Speech recognition; Tagging; Training data; Active learning; Co-training; Mandarin speech recognition; POS tagging; Self-training;
Conference_Titel :
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
1-4244-0727-3
DOI :
10.1109/ICASSP.2007.367182