Keyword extraction of web pages based on domain thesaurus

Author

Guowan He ; Jie Wang ; Yafeng Zhang ; Yan Peng

Author_Institution

Sch. of Manage., Capital Normal Univ., Beijing, China

fYear

2014

Firstpage

310

Lastpage

314

Abstract

This paper presents a keyword extraction method of web pages based on domain thesaurus. The method extracts keywords from web pages based on traditional statistic features, such as frequency and location, and it also evaluates the weight of candidate keywords combining with their relation of domain thesaurus. This method can effectively identify domain keywords of web pages with low frequency but more information in specific area. Based on the web pages keywords extraction of environment domain as an example, this paper introduces the framework and algorithm of the method. Experimental results show that, compared with the traditional TF-IDF method, this method has a better keyword extraction performance in environment-related web pages, an average of 20% recall rate, and an average of 15 percent accuracy rate.

Keywords

Internet; statistical analysis; Internet; Web pages; domain thesaurus; keyword extraction method; Accuracy; Feature extraction; Support vector machines; Thesauri; Domain thesaurus; Keyword extraction; Keyword of web pages; Keyword weight;

fLanguage

English

Publisher

ieee

Conference_Titel

Cloud Computing and Intelligence Systems (CCIS), 2014 IEEE 3rd International Conference on

Print_ISBN

978-1-4799-4720-1

Type

conf

DOI

10.1109/CCIS.2014.7175749

Filename

7175749