DocumentCode
2698219
Title
Web informative content block detecting based on entropy and parent-child relationship in DOM
Author
Ding, Yanhui ; Li, Qingzhong ; Yan, Zhongmin ; Dong, Yongquan
Author_Institution
Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan
fYear
2008
fDate
20-23 June 2008
Firstpage
175
Lastpage
178
Abstract
To increase the commercial value and accessibility of pages, most sites tend to publish their pages with redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information almost exists in all pages of the website, which increases the index size of general search engines and causes page topics to drift. In this paper, we propose an informative content blocks detecting system called WICBDPCR (Web Informative Content Block Detecting based on Parent-Child Relationship in the document object model) which applies Information Theory to DOM tree in order to detect the informative structure. Experiments on several real commercial Web sites show high precision and recall rates, which validate WICBDPCRpsilas practical applicability.
Keywords
Web sites; document handling; entropy; information retrieval; search engines; tree data structures; Web informative content block detection system; Web sites; document object model tree; entropy method; information theory; parent-child relationship; search engine; Automation; Computer science; Data mining; Entropy; IEEE news; Information theory; Navigation; Object detection; Search engines; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Information and Automation, 2008. ICIA 2008. International Conference on
Conference_Location
Changsha
Print_ISBN
978-1-4244-2183-1
Electronic_ISBN
978-1-4244-2184-8
Type
conf
DOI
10.1109/ICINFA.2008.4607991
Filename
4607991
Link To Document