Title :
A Research on Multi-feature Word-Level Paraphrase Extracting System Based on Context
Author :
He Xian-Jiang ; Yu Zhong-hua
Author_Institution :
Coll. of Comput. Sci., Sichuan Univ., Chengdu, China
Abstract :
The essence of paraphrasing lies in retrieving correct paraphrases. Word-level paraphrasing is sensitive to the context, and its critical indicator is interchangeability. This paper presents a two-stage multi-feature word-level Chinese paraphrase extracting method. In stage one, using data mining technology the target word and its candidate paraphrases are extracted from large-size corpuses and the Internet. In stage two, stratified probability statistical model is established, and seven similarity feature values which are to train binary classifier later are calculated. Finally, candidate paraphrases with high similarity values are filtered out. Experimental results show that (1) Retrieving candidate paraphrases from large-size corpuses through data mining has practical value. On average 3.1 correct paraphrases for a word are obtained, (2) The binary classifier is effective in filtering out the correct paraphrases, with an accuracy of 0.676; (3) 34.32% of the retrieved paraphrases cannot be found in the Chinese Expanded Synonym Dictionary, which proves that the paraphrase retrieving method presented in this paper is an expansion of the traditional paraphrase extracting methods.
Keywords :
data mining; dictionaries; feature extraction; information filtering; natural language processing; pattern classification; probability; statistical analysis; text analysis; Chinese expanded synonym dictionary; binary classifier training; context; data mining technology; interchangeability; large-size corpuses; multifeature word-level paraphrase extracting system; paraphrase retrieving method; paraphrases filtering; similarity feature values; stratified probability statistical model; target word; two-stage multifeature word-level Chinese paraphrase extracting method; Accuracy; Context; Data mining; Feature extraction; Semantics; Testing; Training; binary classifier; corpuses; multi-feature; paraphrase;
Conference_Titel :
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-3093-0
DOI :
10.1109/MINES.2012.43