Title :
Learning a subword vocabulary based on unigram likelihood
Author :
Varjokallio, Matti ; Kurimo, Mikko ; Virpioja, Sami
Author_Institution :
Dept. of Signal Process. & Acoust., Aalto Univ., Espoo, Finland
Abstract :
Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.
Keywords :
natural language processing; speech recognition; text analysis; vocabulary; Finnish; speech recognition; subword vocabulary; text corpus; unigram likelihood; vocabulary units; Entropy; Hidden Markov models; Speech; Speech recognition; Training; Viterbi algorithm; Vocabulary; Large Vocabulary Continuous Speech Recognition; Subword Modeling; Vocabulary Selection;
Conference_Titel :
Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on
Conference_Location :
Olomouc
DOI :
10.1109/ASRU.2013.6707697