Title :
Speaker-invariant and rhythm-sensitive representation of spoken words
Author :
Minematsu, Nobuaki ; Ozaki, Yoshito ; Hirose, Keikichi ; Erickson, David
Author_Institution :
Grad. Sch. of Eng., Univ. of Tokyo, Tokyo, Japan
fDate :
Oct. 29 2013-Nov. 1 2013
Abstract :
It is well-known that human speech recognition (HSR) is much more robust than automatic speech recognition (ASR) [1], [2]. Given that HSR´s robustness to large acoustic variability is extremely high, it is reasonable for researchers to assume that humans are able to extract invariant patterns underlying input utterances [3]. Recently in developmental psychology, it was found that infants are very sensitive to distributional properties in the sounds of a language [4], [5]. Following this finding, the first author proposed a speaker-independent or invariant speech representation of each utterance, formed by using distributional properties in the sounds of that utterance [6], [7], [8]. This representation is called speech structure and was tested in isolated word recognition experiments [7], [8]. This paper introduces another kind of sensitivity into speech structure, that is sensitivity to language rhythm. Sonority-based syllable nucleus detection is implemented and we extract local and syllable-based structures as well as conventional global and holistic structures. Isolated word recognition experiments show that the recognition performance is improved with rhythmsensitive and local speech structures.
Keywords :
speech processing; speech recognition; ASR; HSR robustness; acoustic variability; automatic speech recognition; developmental psychology; holistic structures; human speech recognition; invariant speech representation; isolated word recognition experiments; language rhythm; local speech structures; rhythm sensitive representation; sonority based syllable nucleus detection; speaker independent; speaker invariant; spoken words; syllable based structures; Acoustics; Adaptation models; Computational modeling; Hidden Markov models; Speech; Speech recognition; Vectors;
Conference_Titel :
Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific
Conference_Location :
Kaohsiung
DOI :
10.1109/APSIPA.2013.6694162