Title :
Towards Learning Morphology for Under-Resourced Fusional and Agglutinating Languages
Author :
Shalonova, Ksenia ; Golénia, Bruno ; Flach, Peter
Author_Institution :
Dept. of Comput. Sci., Univ. of Bristol, Bristol
fDate :
7/1/2009 12:00:00 AM
Abstract :
In this paper, we describe a novel and effective approach for automatically decomposing a word into stem and suffixes. Russian and Turkish are used as exemplars of fusional and agglutinating languages. Rather than relying on corpus counts, we use a small number of word-pairs as training data, that can be particularly suited for under-resourced languages. For fusional languages, we initially learn a tree of aligned suffix rules (TASR) from word-pairs. The tree is built top-down, from general to specific rules, using suffix rule frequency and rule subsumption, and is executed bottom-up, i.e., the most specific rule that fires is chosen. TASR is used to segment a word form into a stem and suffix sequence. For fusional languages learning through generation (using TASR) is essential for proper stem extraction. Subsequently, an unsupervised segmentation algorithm graph-based unsupervised suffix segmentation (GBUSS) is used to segment the suffix sequence. GBUSS employs a suffix graph where node merging, guided by an information-theoretic measure, generates suffix sequences. The approach, experimentally validated on Russian, is shown to be highly effective. For agglutinating languages only the GBUSS is needed for word decomposition. Promising experimental results for Turkish are obtained.
Keywords :
information theory; natural language processing; tree data structures; unsupervised learning; word processing; agglutinating language; fusional language; graph-based unsupervised suffix segmentation; information theory; rule subsumption; suffix rule frequency; tree of aligned suffix rules; word-pairs; Data analysis; Frequency; Fusion power generation; Learning automata; Morphology; Natural languages; Spatial databases; Speech analysis; Speech synthesis; Training data; Fusional and agglutinating languages; morphology learning; under-resourced languages; weakly supervised learning;
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
DOI :
10.1109/TASL.2009.2015694