LZW Based Distance Measures for Spoken Language Identification

Author

Basavamja, S.V. ; Sreenivas, T.V.

Author_Institution

Dept. of Electr. Commun. Eng., Indian Inst. of Sci., Bangalore

fYear

2006

fDate

28-30 June 2006

Firstpage

1

Lastpage

6

Abstract

We present a new approach to spoken language modeling for language identification (LID) using the Lempel-Ziv-Welch (LZW) algorithm. The LZW technique is applicable to any kind of tokenization of the speech signal. Because of the efficiency of LZW algorithm to obtain variable length symbol strings in the training data, the LZW codebook captures the essentials of a language effectively. We develop two new deterministic measures for LID based on the LZW algorithm namely: (i) Compression ratio score (LZW-CR) and (ii) weighted discriminant score (LZW-WDS). To assess these measures, we consider error-free tokenization of speech as well as artificially induced noise in the tokenization. It is shown that for a 6 language LID task of OGI-TS database with clean tokenization, the new model (LZW-WDS) performs slightly better than the conventional bigram model. For noisy tokenization, which is the more realistic case, LZW-WDS significantly outperforms the bigram technique

Keywords

natural languages; speech processing; speech recognition; LZW-WDS technique; Lempel-Ziv-Welch algorithm; OGI-TS database; error-free tokenization; speech signal; spoken language identification; weighted discriminant score; Databases; Electric variables measurement; Maximum likelihood estimation; Natural languages; Neural networks; Noise measurement; Speech enhancement; Stochastic processes; TV; Training data;

fLanguage

English

Publisher

ieee

Conference_Titel

Speaker and Language Recognition Workshop, 2006. IEEE Odyssey 2006: The

Conference_Location

San Juan

Print_ISBN

1-424400471-1

Electronic_ISBN

1-4244-0472-X

Type

conf

DOI

10.1109/ODYSSEY.2006.248103

Filename

4013520