مرکز منطقه ای اطلاع رساني علوم و فناوري - Automated Extraction of Lexicon Applied both to Chinese and Japanese Corpora

DocumentCode :

2168386

Title :

Automated Extraction of Lexicon Applied both to Chinese and Japanese Corpora

Author :

Shujing Ke ; Shiu, Simon C. K. ; Goertzel, Ben ; Yu, Guanding ; Xiaodong Shi ; Changle Zhou

Author_Institution :

Dept. of Comput., Hong Kong Polytech. Univ., Hong Kong, China

fYear :

2012

fDate :

26-28 Nov. 2012

Firstpage :

Lastpage :

Abstract :

A novel statistical approach is described, enabling the automated extraction of large word lists from unsegmented corpora without reliance on existing dictionaries. The main contribution of this approach includes the following two points: First, it\´s very generic and has been successfully applied separately to both Chinese and Japanese, Second, it doesn\´t take any use of punctuation information, so compared to most of the existing methods, it doesn\´t need to pre-process the corpora to remove the punctuations or to pre-segment the corpora by punctuations. Our experiment results in the extraction of 14,087 Chinese words and 15,553 Japanese words. Precision achieved is over 80% for two-character Chinese words, over 90% for one-character Japanese words and over 70% for two-character Japanese words. And we\´ve also successfully extracted most of single-character words including common functional characters, such in, and, or, \´s, also, a family name in Chinese, hiragana such as " ?,"" ?,"" ?" in Japanese, and punctuations such as ",", "", "?".

Keywords :

natural language processing; statistical analysis; Chinese corpora; Chinese word extraction; Japanese corpora; Japanese word extraction; lexicon extraction; statistical approach; unsegmented corpora; Combination Degree; Punctuation; Statistics; Word extraction;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advanced Computer Science Applications and Technologies (ACSAT), 2012 International Conference on

Conference_Location :

Kuala Lumpur

Print_ISBN :

978-1-4673-5832-3

Type :

conf

DOI :

10.1109/ACSAT.2012.15

Filename :

6516318

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2168386