• DocumentCode
    1958400
  • Title

    Spectral mismatch as the index of quality of naturalness in synthetic speech

  • Author

    Kawachale, S.P. ; Gengaje, S.R. ; Chitode, J.S.

  • Author_Institution
    Dept. of E&TC, M.I.T., Pune, India
  • fYear
    2009
  • fDate
    23-26 Aug. 2009
  • Firstpage
    808
  • Lastpage
    813
  • Abstract
    It is extremely tough to make a machine which sounds identical to human. Hence the best text to speech (TTS) algorithm ever made sounds robotic, until and unless human speech itself is involved in it. But it is not possible to create a database of each and every word possible in any language. Syllable based concatenative speech synthesis (CSS) leads to formation of new words from existing words in data base. Improper concatenation with respect to position of the syllable leads to spectral mismatch. A first step to overcome this is to estimate spectral mismatch with respect to position of the syllable. We propose a method based on power spectral density (PSD) to estimate position dependent spectral mismatch. This can be done by plotting power spectral density of 10 millisecond samples of original, properly concatenated (PC) and improperly concatenated (IC) words. These samples are then made noise free to neglect their low amplitude peaks. Analysis of PSD leads to locate formants in the given samples. Formants for original, properly and improperly concatenated words is then plotted. It is observed that formant plots for original and properly concatenated words are very similar for all formants while for improper concatenation extra peaks are observed in all formants. These extra peaks can be considered as estimation for spectral mismatch. The results are validated using Marathi text to speech synthesis.
  • Keywords
    speech synthesis; concatenative speech synthesis; power spectral density; spectral mismatch; synthetic speech; text-to-speech algorithm; Acoustic noise; Cascading style sheets; Concatenated codes; Databases; Frequency; Humans; Magnetic heads; Robots; Speech analysis; Speech synthesis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communications, Computers and Signal Processing, 2009. PacRim 2009. IEEE Pacific Rim Conference on
  • Conference_Location
    Victoria, BC
  • Print_ISBN
    978-1-4244-4560-8
  • Electronic_ISBN
    978-1-4244-4561-5
  • Type

    conf

  • DOI
    10.1109/PACRIM.2009.5291267
  • Filename
    5291267