DocumentCode :
476253
Title :
Research on automatic acquisition of domain terms
Author :
Liu, Juan ; Liu, Yuan-Chao ; Jiang, Wei ; Wang, Xiao-long
Author_Institution :
Dept. of Comput. Sci. & Technol., Harbin Inst. of Technol., Harbin
Volume :
5
fYear :
2008
fDate :
12-15 July 2008
Firstpage :
3026
Lastpage :
3031
Abstract :
In order to solve the various issues in natural language processing more precisely, it is important to construct a system for automatic acquisition of domain terms. A method for automatic acquisition of domain terms from raw materials that are not segmented is presented in this paper. The raw domain corpus is pre-processed firstly. Then by using the method of information entropy and log-likelihood ratio, we can extract candidate words automatically, after this we use the open-domain lexicon to preserve domain terms by removing general words. At last, confidence is used to remove the non-meaningful words to improve term acquisition accuracy from domain candidate term set, and the special domain lexicon is constructed finally. The experimental results show that this simple method is efficient in extracting most of the domain terms. The domain terms we extracted have been effectively applied in personalized Chinese word segmentation system.
Keywords :
information retrieval; natural language processing; text analysis; automatic domain term acquisition; candidate word extraction; domain candidate term set; domain corpus; information entropy; log-likelihood ratio; natural language processing; open-domain lexicon; personalized Chinese word segmentation system; Computer science; Cybernetics; Data mining; Information entropy; Machine learning; Materials science and technology; Natural language processing; Raw materials; Statistics; Tagging; Automatic Term Extraction; Domain Terms; Information Entropy; Log-Likelihood Ratio; Natural Language Processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics, 2008 International Conference on
Conference_Location :
Kunming
Print_ISBN :
978-1-4244-2095-7
Electronic_ISBN :
978-1-4244-2096-4
Type :
conf
DOI :
10.1109/ICMLC.2008.4620926
Filename :
4620926
Link To Document :
بازگشت