Title :
Kanji-to-Hiragana conversion based on a length-constrained n-gram analysis
Author :
Picone, Joseph ; Staples, Tom ; Kondo, Kazuhiro ; Arai, Nozomi
Author_Institution :
Res. & Dev. Centre, Texas Instrum., Tsukuba, Japan
fDate :
11/1/1999 12:00:00 AM
Abstract :
A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation. In this paper, we focus on an aspect of this problem specific to the Japanese language. Written Japanese consists of a mixture of three types of symbols: Kanji, Hiragana, and Katakana. We describe an algorithm for converting conventional Japanese orthography to a Hiragana-like symbol set that closely approximates the most common pronunciation of the text. The algorithm is based on two hypotheses: (1) the correct reading of a Kanji character can be determined by examining a small number of adjacent characters and (2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input test by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was shown to provide better performance than several public domain algorithms, achieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorithm described here has applications to similar problems in other languages
Keywords :
dynamic programming; grammars; natural languages; pattern matching; speech processing; speech synthesis; Hiragana-like symbol set; Japanese language; Japanese orthography; Kanji character; Kanji-to-Hiragana conversion; Katakana; closed-set testing; dictionary; dynamic programming; input test conversion; length-constrained n-gram analysis; n-gram algorithm; orthographic units; pattern matching algorithm; performance; phonetic symbols; pronunciation; public domain algorithms; sentence error rate; speech processing; text material; written Japanese; written form; Databases; Dictionaries; Instruments; Natural languages; Research and development; Speech analysis; Speech processing; Speech recognition; Speech synthesis; Writing;
Journal_Title :
Speech and Audio Processing, IEEE Transactions on