Author_Institution :
SpeechLab, Tech. Univ. of Liberec, Liberec, Czech Republic
Abstract :
This paper is focused on the task of detecting words of interest in an audio scene (a room, a lab or a workshop) or in a continually recorded stream of speech, music and other sounds. The solution of this task is important in many applications, e.g. for command control in houses for handicapped persons, for automating some manufacturing and logistical operations, or for information retrieval from large audio archives. We investigate the use of three keyword spotting techniques and compare them with a classic large vocabulary speech recognition system. To evaluate their performance, we specified and studied two model applications: 1) search in large audio broadcast archive; 2) voice control of an interactive system. The investigated techniques were evaluated from several points of view, namely their speed (real-time factor), accuracy (equal error rate, figure of merit, receiver operating characteristics), the demands for training data and the impact of different types of noise.
Keywords :
audio streaming; speech recognition; audio broadcast archive; audio scene; audio streams; equal error rate; handicapped persons; information retrieval; interactive system; keywords detection; logistical operations; manufacturing operations; performance comparison; real-time factor; receiver operating characteristics; vocabulary speech recognition system; voice control; Hidden Markov models; Signal to noise ratio; Speech; Speech recognition; Testing; Vocabulary; Comparison; Filler Model; Keyword Spotting; Speech Processing; Speech Recognition; System Control;