Title :
Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati
Author :
Patil, Hemant A. ; Patel, T. ; Talesara, Swati ; Shah, Neil ; Sailor, Hardik ; Vachhani, Bhavik ; Akhani, Janki ; Kanakiya, Bhargav ; Gaur, Yashesh ; Prajapati, V.
Author_Institution :
Dhirubhai Ambani Inst. of Inf. & Commun. Technol. (DA-IICT), Gandhinagar, India
Abstract :
Text-to-speech (TTS) synthesizer has been an effective tool for many visually challenged people for reading through hearing feedback. TTS synthesizers build through the festival framework requires a large speech corpus. This corpus needs to be labeled. The labeling can be done at phoneme-level or at syllable-level. TTS systems are mostly available in English, however, it has been observed that people feel more comfortable in hearing their own native language. Keeping this point in mind, Gujarati TTS synthesizer has been built. As Indian languages are syllabic in nature, syllable is taken as the basic speech sound unit. In building the unit selection-based Gujarati TTS system, one requires large Gujarati labeled corpus. The task of labeling is manual, most time-consuming and tedious. Therefore, in this work, an attempt has been made to reduce these efforts by automatically generating almost accurate labeled speech corpus at syllable-level. To that effect, group delay-based segmentation, spectral transition measure (STM)-based and Gaussian filter-based methods are presented and their performances are compared. It has been observed that percentage of correctness of labeled data is around 83 % for both male and female voice as compared to 70 % for group delay-based labeling and 78 % for STM-based labeling. In addition, the systems built by labeled files generated from above methods were evaluated by a visually challenged subject. The word correctness rate is increased by 5 % (3 %) and 10 % (12 %) for Gaussian filter-based TTS system as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice. Similarly, there is an overall reduction in the word error rate (WER) of Gaussian-based approach of 8% (2%) and 6% (-5%) as compared to group delay-based TTS and Spectral Transition Measure (STM)-based system built on female (male) voice.
Keywords :
Gaussian processes; filtering theory; natural language processing; speech synthesis; English; Gaussian filter-based TTS system; Gaussian filter-based methods; Gaussian-based approach-based system; Gujarati TTS synthesizer; festival framework; phoneme-level; spectral transition measure; spectral transition measure-based system; speech corpus; speech segmentation; syllable-level; text-to-speech synthesis system; word error rate; Auditory system; Buildings; Delays; Labeling; Manuals; Speech; Synthesizers; Gaussian filter; Tts; group delay; labeling; spectral transition measure; syllable;
Conference_Titel :
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference
Conference_Location :
Gurgaon
DOI :
10.1109/ICSDA.2013.6709852