DocumentCode
3258989
Title
Automatic Keyword Extraction Using Linguistic Features
Author
Hu, Xinghua ; Wu, Bin
Author_Institution
Baskin Sch. of Eng., California Univ., Santa Cruz, CA
fYear
2006
fDate
Dec. 2006
Firstpage
19
Lastpage
23
Abstract
This paper describes a novel keyword extraction algorithm position weight (PW) that utilizes linguistic features to represent the importance of the word position in a document. Topical terms and their previous-term and next-term co-occurrence collections are extracted. To measure the degree of correlation between a topical term and its co-occurrence terms, three methods are employed including term frequency inverse term frequency (TFITF), position weight inverse position weight (PWIPW), and CHI-square (chi2). The co-occurrence terms that have the highest degree of correlation and exceed a co-occurrence frequency threshold are combined together with the original topical term to form a final keyword. With the linear computational complexity of the algorithm, the vector space of documents in a large corpus or boundless Web can be quickly represented by sets of keywords, which makes it possible to retrieve large-scale information fast and effectively
Keywords
computational complexity; data mining; document handling; information retrieval; CHI-square; automatic keyword extraction; boundless Web; cooccurrence collections; cooccurrence frequency threshold; cooccurrence terms; large corpus; linear computational complexity; linguistic features; position weight inverse position weight; term frequency inverse term frequency; topical terms; vector space; word position; Computational complexity; Content based retrieval; Data mining; Feature extraction; Frequency measurement; Information retrieval; Large-scale systems; Position measurement; Vectors; World Wide Web;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on
Conference_Location
Hong Kong
Print_ISBN
0-7695-2702-7
Type
conf
DOI
10.1109/ICDMW.2006.36
Filename
4063591
Link To Document