Author :
Jiang, Wei ; Guan, Yi ; Wang, Xiao-long
Abstract :
Unknown word recognition (UWR) is a difficult and foundational task in lexical processing and content-based understanding, and it can improve many text-based processing applications, such as information extraction, question answer system, electronic meeting system. However the unified dealing approach is difficult to exploit more domain knowledge features, so the performance cannot be further improved easily, since UWR has been proved to be NP-hard problem. This paper presents a novel method for UWR task, which divides the UWR into several hard sub-tasks that usually encountering different difficulties, accordingly, several language models are adopted to solve the special sub-tasks, so as to exert the ability of each model in addressing special problems. Firstly, a class-based trigram is used in basic word segmentation, aided with absolute smoothing algorithm to overcome data sparseness, and maximum entropy model (ME) is used to recognize named entity. New word detection adopts variance and conditional random fields algorithm. Secondly, multi-knowledge features are effectively extracted and utilized in whole processing. Our system participated in the Second International Chinese Word Segmentation Bakeoff (SIGHAN2005), and got the overall performance 97.2% F-measure in MSRA open test
Keywords :
context-free grammars; maximum entropy methods; natural language processing; smoothing methods; text analysis; word processing; NP-hard problem; class-based trigram; conditional random fields algorithm; content-based understanding; data sparseness; language models; lexical processing; maximum entropy model; multiknowledge source method; out-of-vocabulary word recognition; question answer system; smoothing algorithm; text-based processing; unknown word recognition model; word detection; word segmentation; Application software; Computer science; Data mining; Entropy; Feature extraction; Hidden Markov models; Information retrieval; NP-hard problem; Smoothing methods; System testing; Conditional Random Fields; Maximum Entropy Model; Out-of-Vocabulary word recognition; Question Answer System.; Unknown word recognition;