مرکز منطقه ای اطلاع رساني علوم و فناوري - Extracting the semantic content of web pages via repeated structures

DocumentCode :

639084

Title :

Extracting the semantic content of web pages via repeated structures

Author :

Zheng He ; Hangzai Luo ; Jianping Fan ; Xiao Liu

Author_Institution :

East China Normal Univ., Shanghai, China

fYear :

2013

fDate :

15-19 July 2013

Firstpage :

Lastpage :

Abstract :

Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the underlying semantics of the page to different positions of the page, such as advertizements, guide bars, links. As a result, it may not lead good effect by using all the data of a web page to model its semantics. In this paper, we propose a framework that can extract the real semantic content from web pages via repeated structures of the HTML data. Our algorithm first detect the real semantic blocks in web pages via repeated structure segmentation, then extracts the real semantic content of the pages from real semantic blocks.

Keywords :

Web sites; hypermedia markup languages; information retrieval; HTML data; Web page semantics model; repeated structure segmentation; semantic block detection; semantic content extraction; Data mining; Feature extraction; HTML; Nickel; Semantics; Visualization; Web pages; Repeated Structure; Semantic modeling; Web page;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Multimedia and Expo Workshops (ICMEW), 2013 IEEE International Conference on

Conference_Location :

San Jose, CA

Type :

conf

DOI :

10.1109/ICMEW.2013.6618450

Filename :

6618450

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=639084