Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition

Author

Chongjia Ni ; Cheung-Chi Leung

Author_Institution

Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore

fYear

2014

fDate

12-14 Sept. 2014

Firstpage

44

Lastpage

48

Abstract

Chinese word segmentation (CWS) is a necessary step in Mandarin Chinese automatic speech recognition (ASR), and it has an impact on the results of ASR. However, there are few works on the relations between CWS and ASR. CWS settings, including segmentation standards and algorithms, are involved in building a segmenter. In this paper, four CWS standards and three CWS algorithms, including maximum matching, term frequency based and conditional random field (CRF) based algorithms, are investigated for ASR performance. Our experiments on the second Sighan Bakeoff data and Mandarin Chinese conversational telephone speech show that a better segmentation performance does not necessarily lead to a better ASR performance. Maximum matching and the term frequency based algorithm, which are classified as lexicon-based algorithms, are more flexible to update their vocabulary inventories according to the application need. We find that these two algorithms can provide similar ASR performance as the CRF-based algorithm. Motivated by the availability of huge amounts of web text data, we investigate whether this can improve the term frequency based algorithm and thus the ASR performance. Lastly we find that combining the two lexicon-based algorithms through language model interpolation can further improve the ASR performance.

Keywords

natural language processing; speech recognition; ASR performance; CRF-based algorithm; CWS algorithms; CWS settings; CWS standards; Chinese word segmentation standards; Mandarin Chinese automatic speech recognition; Mandarin Chinese conversational telephone speech; Sighan Bakeoff data; Web text data; conditional random field; language model interpolation; lexicon-based algorithms; maximum matching; segmenter; term frequency based algorithm; vocabulary inventories; Classification algorithms; Computational modeling; Data models; Speech; Standards; Training; Training data; Chinese word segmentation; Chinese word segmentation combination; automatic speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on

Conference_Location

Singapore

Type

conf

DOI

10.1109/ISCSLP.2014.6936684

Filename

6936684