مرکز منطقه ای اطلاع رساني علوم و فناوري - Sinteiseoir 1.0: a multidialectical TTSapplication for Irish

Abstract :

This paper details the development of a multidialectical text-to-speech (TTS) application,Sinte´iseoir, for the Irish language. This work is being carried out in the context of Irish as alesser-used language, where learners and other L2 speakers have limited direct exposure to L1speakers and speech communities, and where native sound systems and vocabularies can beseen to be receding even among L1 speakers – particularly the young.Sinte´iseoir essentially implements the diphone concatenation model, albeit augmented toinclude phones, half-phones and, potentially, other phonic units. It is based on a platformindependentframework comprising a user interface, a set of dialect-specific tokenisationengines, a concatenation engine and a playback device.The tokenisation strategy is entirely rule-based and does not refer to dictionary look-ups.Provision has been made for prosodic processing in the framework but has not yet beenimplemented. Concatenation units are stored in the form of WAV files on the local file system.Sinte´iseoir’s user interface (UI) provides a text field that allows the user to submit agrapheme string for synthesis and a prompt to select a dialect. It also filters input to rejectgraphotactically invalid strings, restrict input to alphabetic and certain punctuation marksfound in Irish orthography, and ensure that a dialect has, indeed, been selected.The UI forwards the filtered grapheme string to the appropriate tokenisation engine. Thissearches for specified substrings and maps them to corresponding tokens that themselvescorrespond to concatenation units.The resultant token string is then forwarded to the concatenation engine, which retrievesthe relevant concatenation units, extracts their audio data and combines them in a new unit.This is then forwarded to the playback device.The terms of reference for the initial development of Sinte´iseoir specified that it should becapable of uttering, individually, the 99 most common Irish lemmata in the dialects of AnSpide´ al, Mu´ sgraı´ Uı´ Fhloı´nn and Gort a’ Choirce, which are internally consistent dialectswithin the Connacht, Munster and Ulster regions, respectively, of the dialect continuum.Audio assets to satisfy this requirement have already been prepared, and have been found toproduce reasonably accurate output. The tokenisation engine is, however, capable of processinga wider range of input strings and when required concatenation units are found to beunavailable, returns a report via the user interface.