DocumentCode :
672320
Title :
Learning a subword vocabulary based on unigram likelihood
Author :
Varjokallio, Matti ; Kurimo, Mikko ; Virpioja, Sami
Author_Institution :
Dept. of Signal Process. & Acoust., Aalto Univ., Espoo, Finland
fYear :
2013
fDate :
8-12 Dec. 2013
Firstpage :
7
Lastpage :
12
Abstract :
Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.
Keywords :
natural language processing; speech recognition; text analysis; vocabulary; Finnish; speech recognition; subword vocabulary; text corpus; unigram likelihood; vocabulary units; Entropy; Hidden Markov models; Speech; Speech recognition; Training; Viterbi algorithm; Vocabulary; Large Vocabulary Continuous Speech Recognition; Subword Modeling; Vocabulary Selection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on
Conference_Location :
Olomouc
Type :
conf
DOI :
10.1109/ASRU.2013.6707697
Filename :
6707697
Link To Document :
بازگشت