Title :
The development of file formats for very large speech corpora: SPHERE and SHORTEN
Author :
Garofolo, J.S. ; Robinson, Tony ; Fiscus, Jonathan G.
Author_Institution :
Nat. Inst. of Stand. & Technol., USA
Abstract :
The performance of large vocabulary speech recognition systems is currently thought to be limited by the size of the corpus used to train the recognition system. Hence several very large speech corpora have been created recently and many more are planned. A significant problem in the generation of these corpora is the definition of their format to minimize distribution costs and maximize ease of use. This paper describes the development of a “standard” lossless compressed waveform file format which minimizes the media required for corpora distribution while maximizing accessibility. This paper contains two primary contributions: 1) The use of a “standard” file format for speech corpora which supports embedded compression and the development of a software interface toolkit which supports automatic waveform compression/decompression; 2) The use of lossless data compression for speech corpora. This task differs from mainstream speech coding in that the compression must be fast and lossless. Fast approximations to the standard techniques of linear prediction and residual coding have been developed and are employed
Keywords :
data compression; data structures; linear predictive coding; speech coding; speech recognition; SHORTEN; SPHERE; decompression; distribution costs; ease of use; embedded compression; file formats; large vocabulary speech recognition systems; linear prediction; lossless compressed waveform file format; lossless data compression; performance; residual coding; software interface toolkit; speech coding; very large speech corpora; Costs; Data compression; Ear; Embedded software; NIST; Random media; Software tools; Speech; Standards development; Vocabulary;
Conference_Titel :
Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on
Conference_Location :
Adelaide, SA
Print_ISBN :
0-7803-1775-0
DOI :
10.1109/ICASSP.1994.389342