Title :
Semi-supervised topic classification for low resource languages
Author :
Liu, Daben ; McVeety, Sam ; Prasad, Rohit ; Natarajan, Prem
Author_Institution :
BBN Technol., Cambridge, MA
fDate :
March 31 2008-April 4 2008
Abstract :
In this paper, we present a novel methodology for rapidly developing a topic-based document classification system for a language that has limited resources. Our approach, a hybrid one, combines supervised and unsupervised topic classification techniques. Given that access to native speakers is fairly limited for low resource languages, our approach requires annotating only a few broad "root" topics in the corpus. Next, unsupervised topic discovery (UTD) technique is used to automatically determine finer topics within the root topics. Lastly, we use the recently developed unsupervised topic clustering technique to organize the corpus into a hierarchical structure that enables browsing documents at multiple levels of granularity. Recognizing the need for reducing false alarms during runtime, we describe rejection techniques for discarding off-topic documents.
Keywords :
classification; document handling; hidden Markov models; natural language processing; unsupervised learning; hidden Markov model; low resource language; semi unsupervised topic-based document classification; unsupervised topic discovery; Broadcasting; Hidden Markov models; Humans; Internet; Natural languages; Runtime; Search engines; Testing; Topology; Web sites; Hidden Markov Model; Malay; off-topic rejection; topic clustering; unsupervised topic discovery;
Conference_Titel :
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-1483-3
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2008.4518804