مرکز منطقه ای اطلاع رساني علوم و فناوري - Bootstrapping Semantic Annotation for Content-Rich HTML Documents

DocumentCode :

2848141

Title :

Bootstrapping Semantic Annotation for Content-Rich HTML Documents

Author :

Mukherjee, Saikat ; Ramakrishnan, I.V. ; Singh, Amarjeet

Author_Institution :

Dept. of Comput. Sci., Stony Brook Univ., NY, USA

fYear :

2005

fDate :

05-08 April 2005

Firstpage :

583

Lastpage :

593

Abstract :

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety ofWeb sources. We also present experimental results on the effectiveness of the technique.

Keywords :

data integrity; hypermedia markup languages; ontologies (artificial intelligence); semantic Web; statistical analysis; HTML documents; Web sources; bootstrapping technique; content-rich documents; data consistency; semantic Web processing; statistical model; Computer science; HTML; Labeling; Next generation networking; Ontologies; Pricing; Resource description framework; Semantic Web; Vehicles; XML;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on

ISSN :

1084-4627

Print_ISBN :

0-7695-2285-8

Type :

conf

DOI :

10.1109/ICDE.2005.28

Filename :

1410176

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2848141