Title :
Extracting the semantic content of web pages via repeated structures
Author :
Zheng He ; Hangzai Luo ; Jianping Fan ; Xiao Liu
Author_Institution :
East China Normal Univ., Shanghai, China
Abstract :
Web pages may carry semantics that are very important to the authors and the readers. Due to many reasons, the authors may insert contents that are irrelevant to the underlying semantics of the page to different positions of the page, such as advertizements, guide bars, links. As a result, it may not lead good effect by using all the data of a web page to model its semantics. In this paper, we propose a framework that can extract the real semantic content from web pages via repeated structures of the HTML data. Our algorithm first detect the real semantic blocks in web pages via repeated structure segmentation, then extracts the real semantic content of the pages from real semantic blocks.
Keywords :
Web sites; hypermedia markup languages; information retrieval; HTML data; Web page semantics model; repeated structure segmentation; semantic block detection; semantic content extraction; Data mining; Feature extraction; HTML; Nickel; Semantics; Visualization; Web pages; Repeated Structure; Semantic modeling; Web page;
Conference_Titel :
Multimedia and Expo Workshops (ICMEW), 2013 IEEE International Conference on
Conference_Location :
San Jose, CA
DOI :
10.1109/ICMEW.2013.6618450