DocumentCode :
1297281
Title :
Performance Analysis and Improvement of Turkish Broadcast News Retrieval
Author :
Parlak, Siddika ; Saraçlar, Murat
Author_Institution :
Dept. of Electr. & Comput. Eng., Rutgers Univ., Piscataway, NJ, USA
Volume :
20
Issue :
3
fYear :
2012
fDate :
3/1/2012 12:00:00 AM
Firstpage :
731
Lastpage :
741
Abstract :
This paper presents our work on the retrieval of spoken information in Turkish. Traditional speech retrieval systems perform indexing and retrieval over automatic speech recognition (ASR) transcripts, which include errors either because of out-of-vocabulary (OOV) words or ASR inaccuracy. We use subword units as recognition and indexing units to reduce the OOV rate and index alternative recognition hypotheses to handle ASR errors. Performance of such methods is evaluated on our Turkish Broadcast News Corpus with two types of speech retrieval systems: a spoken term detection (STD) and a spoken document retrieval (SDR) system. To evaluate the SDR system, we also build a spoken information retrieval (IR) collection, which is the first for Turkish. Experiments showed that word segmentation algorithms are quite useful for both tasks. SDR performance is observed to be less dependent on the ASR component, whereas any performance change in ASR directly affects STD. We also present extensive analysis of retrieval performance depending on query length, and propose length-based index combination and thresholding strategies for the STD task. Finally, a new approach, which depends on the detection of stems instead of complete terms, is tried for STD and observed to give promising results. Although evaluations were performed in Turkish, we expect the proposed methods to be effective for similar languages as well.
Keywords :
indexing; information resources; information retrieval; speech recognition; Turkish broadcast news corpus; Turkish broadcast news retrieval; automatic speech recognition transcripts; index alternative recognition hypothesis; indexing unit; length-based index combination; out-of-vocabulary words; performance analysis; query length; speech retrieval system; spoken document retrieval system; spoken information retrieval collection; spoken term detection; word segmentation algorithm; Indexing; Lattices; Materials; Speech; Speech recognition; Automatic speech recognition (ASR); speech retrieval; spoken document retrieval; spoken term detection;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2011.2164531
Filename :
5983479
Link To Document :
بازگشت