Annotating Documents by Wikipedia Concepts

Author

Schonhofen, P.

Author_Institution

Comput. & Autom. Res. Inst., Hungarian Acad. of Sci., Budapest

Volume

1

fYear

2008

fDate

9-12 Dec. 2008

Firstpage

461

Lastpage

467

Abstract

We present a technique which is able to reliably label words or phrases of an arbitrary document with Wikipedia articles (concepts) best describing their meaning. First it scans the document content, and when it finds a word sequence matching the title of a Wikipedia article, it attaches the article to the constituent word(s). The collected articles are then scored based on three factors: (1) how many other detected articles they semantically relate to, according to the Wikipedia link structure; (2) how specific is the concept they represent; and (3) how similar is the title by which they were detected to their "official" title. If a text location refers to multiple Wikipedia articles, only the one with the highest score is retained. Experiments on 24,000 randomly selected Wikipedia article bodies showed that 81% of phrases annotated by article authors were correctly identified. Moreover, out of the 5 concepts deemed as the most important by our algorithm during a final ranking, in average 72% was indeed marked in the original text.

Keywords

Web sites; document handling; information retrieval; Wikipedia articles; Wikipedia concepts; arbitrary document; document content; word sequence matching; Automation; Books; Collaboration; IP networks; Intelligent agent; Joining processes; Natural language processing; Ontologies; Text recognition; Wikipedia; Wikipedia; annotation; labeling; link prediction;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on

Conference_Location

Sydney, NSW

Print_ISBN

978-0-7695-3496-1

Type

conf

DOI

10.1109/WIIAT.2008.56

Filename

4740493