Title :
Newly-born keyword extraction under limited knowledge resources based on sentence similarity verification
Author :
Kongachandra, R. ; Kimpant, Chom ; Suwanapong, T. ; Chamnongthai, Kosin
Author_Institution :
King Mongkut´´s Univ. of Technol. Thonburi, Bangkok, Thailand
Abstract :
Most keyword extraction systems utilize external knowledge bases such as lexicon, ontological databases, keyword lists and training corpus for extracting keywords. The drawbacks of using external knowledge bases are formatting incompatibility, domain constraint, and updating difficulty. This work proposes a keyword extraction system, which uses only the knowledge within the document, to extract the keywords. The knowledge includes the title and content of document. All candidates are determined as keywords if they have the same meaning as seed keywords. These seed keywords are previously extracted by employing the document title. We syntactically parse the document title and then use all noun phrase combinations as seed keywords. These keywords are used as the index to pick the relevant sentences in the document and then automatically convert into knowledge representation, called semantic graphs. Simultaneously, the candidates for being accepted as keywords are processed in the same steps as the seed keyword extraction. However, the candidates are altered from the whole sentences in the document instead of the document title as in seed keyword extraction. The similarity score of keyword semantic graphs and candidate semantic graphs are calculated based on conceptual graph operations. The experiments are arranged with 100 documents in domains such as business, computer and psychology. We compared the experimental results to the author-determined keywords and two practical keyword extraction systems i.e. EXTRACTOR and KEA. The experimental results show acceptable performance with 54.81%, 73.87% and 77.94% when they are compared to keywords determined by the author, EXTRACTOR and KEA, respectively.
Keywords :
indexing; knowledge representation; natural languages; semantic networks; text analysis; conceptual graph operations; document content; document title; index; keyword extraction systems; knowledge representation; limited knowledge resources; noun phrase combinations; performance; seed keywords; semantic graphs; sentence similarity verification; syntactic parsing; Art; Business; Computer languages; Data mining; Databases; Information retrieval; Knowledge representation; Ontologies; Psychology; Statistics;
Conference_Titel :
Communications and Information Technology, 2004. ISCIT 2004. IEEE International Symposium on
Print_ISBN :
0-7803-8593-4
DOI :
10.1109/ISCIT.2004.1413905