DocumentCode :
3818181
Title :
Turkish Broadcast News Transcription and Retrieval
Author :
Ebru Arisoy;Dogan Can;Siddika Parlak;Hasim Sak;Murat Saraclar
Author_Institution :
Dept. of Electr. & Electron. Eng., Bogazici Univ., Istanbul
Volume :
17
Issue :
5
fYear :
2009
fDate :
7/1/2009 12:00:00 AM
Firstpage :
874
Lastpage :
883
Abstract :
This paper summarizes our recent efforts for building a Turkish Broadcast News transcription and retrieval system. The agglutinative nature of Turkish leads to a high number of out-of-vocabulary (OOV) words which in turn lower automatic speech recognition (ASR) accuracy. This situation compromises the performance of speech retrieval systems based on ASR output. Therefore using a word-based ASR is not adequate for transcribing speech in Turkish. To alleviate this problem, various sub-word-based recognition units are utilized. These units solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned. As a novel approach, the interaction between recognition units, words and sub-words, and discriminative training is explored. Sub-word models benefit from discriminative training more than word models do, especially in the discriminative language modeling framework. For speech retrieval, a spoken term detection system based on automata indexation is utilized. As with transcription, retrieval performance is measured under various schemes incorporating words and sub-words. Best results are obtained using a cascade of word and sub-word indexes together with term-specific thresholding.
Keywords :
"Broadcasting","Automatic speech recognition","Vocabulary","Natural languages","Speech recognition","Statistical analysis","Information retrieval","Automata","Morphology","Councils"
Journal_Title :
IEEE Transactions on Audio, Speech, and Language Processing
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2008.2012313
Filename :
5071138
Link To Document :
بازگشت