Incorporating information from syllable-length time scales into automatic speech recognition

Author

Wu, Su-Lin ; Kingsbury, Brian E D ; Morgan, Nelson ; Greenberg, Steven

Author_Institution

Int. Comput. Sci. Inst., Berkeley, CA, USA

Volume

2

fYear

1998

fDate

12-15 May 1998

Firstpage

721

Abstract

Including information distributed over intervals of syllabic duration (100-250 ms) may greatly improve the performance of automatic speech recognition (ASR) systems. ASR systems primarily use representations and recognition units covering phonetic durations (40-100 ms). Humans certainly use information at phonetic time scales, but results from psychoacoustics and psycholinguistics highlight the crucial role of the syllable, and syllable-length intervals, in speech perception. We compare the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that uses a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone-scale and syllable-scale recognizers by merging and rescoring N-best lists. Using the combined recognition system, we observed an improvement in word error rate for telephone-bandwidth, continuous numbers from 6.8% to 5.5% on a clean test set, and from 27.8% to 19.6% on a reverberant test set, over the baseline phone-based system

Keywords

decoding; error statistics; feature extraction; pattern classification; signal representation; speech intelligibility; speech processing; speech recognition; 100 to 250 ms; 40 to 100 ms; ASR systems; N-best lists; automatic speech recognition; baseline phone-based system; clean test set; combined recognition system; continuous numbers; experimental system; feature extraction; performance; phone-scale representations; phonetic time scales; psychoacoustics; psycholinguistics; recognition units; reverberant test set; speech decoding; speech intelligibility; speech perception; speech unit classification; syllabic duration; syllable-length time scales; syllable-oriented front-end representation; syllable-scale recognizers; telephone-bandwidth; word error rate; Automatic speech recognition; Computer science; Error analysis; Humans; Merging; Psychoacoustics; Psychology; Speech processing; Speech recognition; System testing;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on

Conference_Location

Seattle, WA

ISSN

1520-6149

Print_ISBN

0-7803-4428-6

Type

conf

DOI

10.1109/ICASSP.1998.675366

Filename

675366