• DocumentCode
    2406620
  • Title

    Development of Hindi mobile communication text and speech corpus

  • Author

    Sinha, Shweta ; Agrawal, S.S. ; Olsen, Jesper

  • Author_Institution
    KIIT Coll. of Eng., Gurgaon, India
  • fYear
    2011
  • fDate
    26-28 Oct. 2011
  • Firstpage
    30
  • Lastpage
    35
  • Abstract
    This paper describes the collection of a text and audio corpus for mobile personal communication in Hindi. Hindi is the largest of the Indian languages, and is the first language for more than 200 million people who use it not only for spoken mobile communication but also for sending text messages to each other. The main script for Hindi is Devanagari, but it is not well supported by the current generation of mobile devices. The Devanagari alphabet is twice as large as for English which makes it difficult to fit onto the small keypad of a mobile device. The aim of this project is to collect text and speech resources which can be used for training spoken language systems that aide text messaging on mobile devices - i.e. train a speech recogniser for the mobile personal communication domain so that text can be input through dictation rather than by typing. In total we collected a text corpus of 2 million words of natural messages in 12 different domains, and a spoken corpus of 100 speakers who each spoke 630 phonetically rich sentences - about 4 hours of speech. The speech utterances were recorded in 16 kHz through 3 recording channels: a mobile phone, a headset and a desktop mounted microphone. The data sets were properly annotated and available for development of speech recognition / synthesis systems in mobile domain.
  • Keywords
    mobile handsets; natural languages; speech recognition; speech synthesis; text analysis; Devanagari alphabet; Hindi mobile communication speech corpus; Hindi mobile communication text; Indian languages; desktop mounted microphone; headset; mobile phone; speech recogniser; speech recognition; speech synthesis systems; spoken language system training; text messaging; Databases; Educational institutions; Mobile communication; Mobile handsets; Speech; Speech recognition; Tagging; Hindi Speech; Speech data base; Text analysis; mobile communication;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Speech Database and Assessments (Oriental COCOSDA), 2011 International Conference on
  • Conference_Location
    Hsinchu
  • Print_ISBN
    978-1-4577-0930-2
  • Type

    conf

  • DOI
    10.1109/ICSDA.2011.6085975
  • Filename
    6085975