District names speech corpus for Pakistani Languages

Author

Sahar Rauf;Asima Hameed;Tania Habib;Sarmad Hussain

Author_Institution

Center for Language Engineering, Al-Khawarizmi Institute of Compute Science, University ofEngineering and Technology, Lahore, Pakistan

fYear

2015

Firstpage

207

Lastpage

211

Abstract

This paper presents a speech corpus that is developed for Urdu automatic speech recognition (ASR) system. The corpus comprises of single word utterances fixed vocabulary consisting of district names of Pakistan. The data is recorded over a telephone channel from all over Pakistan to cover six major accents; Punjabi, Urdu, Saraiki, Pashto, Sindhi, and Balochi. The data was collected in challenging acoustic environments; the major issues were silence, background noise and alternate pronunciations, which can affect the performance of the system. In order to address these issues, comprehensive data verification and cleaning guidelines are presented. The proposed process serves as a data preprocessing step for the development of ASR, which is successfully integrated in an Urdu dialog system to provide weather information of Pakistan.

Keywords

Meteorology

Publisher

ieee

Conference_Titel

Oriental COCOSDA held jointly with 2015 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2015 International Conference

Type

conf

DOI

10.1109/ICSDA.2015.7357893

Filename

7357893

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3713065