• DocumentCode
    177440
  • Title

    Coping with language data sparsity: Semantic head mapping of compound words

  • Author

    Pelemans, Joris ; Demuynck, Kris ; Van hamme, Hugo ; Wambacq, Piet

  • Author_Institution
    Dept. ESAT, Katholieke Univ. Leuven, Leuven, Belgium
  • fYear
    2014
  • fDate
    4-9 May 2014
  • Firstpage
    141
  • Lastpage
    145
  • Abstract
    In this paper we present a novel clustering technique for compound words. By mapping compounds onto their semantic heads, the technique is able to estimate n-gram probabilities for unseen compounds. We argue that compounds are well represented by their heads which allows the clustering of rare words and reduces the risk of over-generalization. The semantic heads are obtained by a two-step process which consists of constituent generation and best head selection based on corpus statistics. Experiments on Dutch read speech show that our technique is capable of correctly identifying compounds and their semantic heads with a precision of 80.25% and a recall of 85.97%. A class-based language model with compound-head clusters achieves a significant reduction in both perplexity and WER.
  • Keywords
    pattern clustering; probability; speech processing; speech recognition; statistics; Dutch read speech; WER; automatic speech recognition; class-based language model; clustering technique; compound word; compound-head clustering; corpus statistics; language data sparsity; n-gram probability estimation; semantic head mapping; Acoustics; Compounds; Conferences; Decision support systems; Speech; Speech processing; OOV; clustering; compounds; n-grams; sparsity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/ICASSP.2014.6853574
  • Filename
    6853574