DocumentCode :
3570556
Title :
Utilizing social media data through similarity-based text normalization for LVCSR language modeling
Author :
Chotimongkol, Ananlada ; Thangthai, Kwanchiva ; Wutiwiwatchai, Chai
Author_Institution :
Nat. Electron. & Comput. Technol. Center, Pathum Thani, Thailand
fYear :
2014
Firstpage :
1
Lastpage :
6
Abstract :
In this paper, we explore the use of social media data in augmenting the lack of large prepared text corpora for LVCSR language modeling. Extensive normalization is required to handle informal and noisy nature of social media text. We propose a similarity-based text normalization approach where similarity in terms of spelling, pronunciation and context are considered. Similarity between a source (nonstandard) word and a target (normalized) word is measured by edit distance and Kullback-Leibler distance. The proposed normalization method can handle the case of homophonic, spelling error and insertion (repeated characters) which occur quite often in Twitter´s texts. We then trained n-gram language models with the normalized texts and achieved up to 60% relative improvement in terms of perplexity and 9% in terms of WER on a mobile speech-to-speech translation task. The proposed approach is applicable to other types of social media texts by its unsupervised manner.
Keywords :
social networking (online); speech recognition; text analysis; vocabulary; Kullback-Leibler distance; LVCSR language modeling; Twitter text; edit distance; homophonic case; mobile speech-to-speech translation task; n-gram language model; nonstandard word; normalization method; normalized text; normalized word; pronunciation; repeated character; similarity-based text normalization; social media data; spelling error case; text corpora; Accuracy; Context; Data models; Media; Mobile communication; Speech; Twitter; Kullback-Leibler distance; LVCSR; edit distance; language modeling; social media; text normalization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA), 2014 17th Oriental Chapter of the International Committee for the
Type :
conf
DOI :
10.1109/ICSDA.2014.7051432
Filename :
7051432
Link To Document :
بازگشت