DocumentCode :
2330818
Title :
Lightly supervised learning of text normalization: Russian number names
Author :
Sproat, Richard
Author_Institution :
Google, Inc, United States
fYear :
2010
fDate :
12-15 Dec. 2010
Firstpage :
436
Lastpage :
441
Abstract :
Most areas of natural language processing today make heavy use of automatic inference from large corpora. One exception is text-normalization for such applications as text-to-speech synthesis, where it is still the norm to build grammars by hand for such tasks as handling abbreviations or the expansion of digit sequences into number names. One reason for this, apart from the general lack of interest in text normalization, has been the lack of annotated data. For many languages, however, there is abundant unannotated data that can be brought to bear on these problems. This paper reports on the inference of number-name expansion in Russian, a particularly difficult language due to its complex inflectional system. A database of several million spelled-out number names was collected from the web and mapped to digit strings using an overgenerating number-name grammar. The same overgenerating number-name grammar can be used to produce candidate expansions into number names, which are then scored using a language model trained on the web data. Our results suggest that it is possible to infer expansion modules for very complex number name systems, from unannotated data, and using a minimum of hand-compiled seed data.
Keywords :
inference mechanisms; learning (artificial intelligence); natural language processing; speech synthesis; Russian number names; automatic inference; digit strings; hand compiled seed data; lightly supervised learning; natural language processing; overgenerating number name grammar; text normalization; text-to-speech synthesis;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Spoken Language Technology Workshop (SLT), 2010 IEEE
Conference_Location :
Berkeley, CA
Print_ISBN :
978-1-4244-7904-7
Electronic_ISBN :
978-1-4244-7902-3
Type :
conf
DOI :
10.1109/SLT.2010.5700892
Filename :
5700892
Link To Document :
بازگشت