• DocumentCode
    134291
  • Title

    Investigation of using different Chinese word segmentation standards and algorithms for automatic speech recognition

  • Author

    Chongjia Ni ; Cheung-Chi Leung

  • Author_Institution
    Inst. for Infocomm Res. (I2R), A*STAR, Singapore, Singapore
  • fYear
    2014
  • fDate
    12-14 Sept. 2014
  • Firstpage
    44
  • Lastpage
    48
  • Abstract
    Chinese word segmentation (CWS) is a necessary step in Mandarin Chinese automatic speech recognition (ASR), and it has an impact on the results of ASR. However, there are few works on the relations between CWS and ASR. CWS settings, including segmentation standards and algorithms, are involved in building a segmenter. In this paper, four CWS standards and three CWS algorithms, including maximum matching, term frequency based and conditional random field (CRF) based algorithms, are investigated for ASR performance. Our experiments on the second Sighan Bakeoff data and Mandarin Chinese conversational telephone speech show that a better segmentation performance does not necessarily lead to a better ASR performance. Maximum matching and the term frequency based algorithm, which are classified as lexicon-based algorithms, are more flexible to update their vocabulary inventories according to the application need. We find that these two algorithms can provide similar ASR performance as the CRF-based algorithm. Motivated by the availability of huge amounts of web text data, we investigate whether this can improve the term frequency based algorithm and thus the ASR performance. Lastly we find that combining the two lexicon-based algorithms through language model interpolation can further improve the ASR performance.
  • Keywords
    natural language processing; speech recognition; ASR performance; CRF-based algorithm; CWS algorithms; CWS settings; CWS standards; Chinese word segmentation standards; Mandarin Chinese automatic speech recognition; Mandarin Chinese conversational telephone speech; Sighan Bakeoff data; Web text data; conditional random field; language model interpolation; lexicon-based algorithms; maximum matching; segmenter; term frequency based algorithm; vocabulary inventories; Classification algorithms; Computational modeling; Data models; Speech; Standards; Training; Training data; Chinese word segmentation; Chinese word segmentation combination; automatic speech recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on
  • Conference_Location
    Singapore
  • Type

    conf

  • DOI
    10.1109/ISCSLP.2014.6936684
  • Filename
    6936684