DocumentCode :
2631386
Title :
Primary Content Block Detection from Web Page Clusters through Entropy and Semantic Distance
Author :
Meng, Jun ; Liu, Qiu-shui ; Li, Ke-qiu
Author_Institution :
Dept. of Comput. Sci. & Eng., Dalian Univ. of Technol., Dalian
fYear :
2008
fDate :
18-20 June 2008
Firstpage :
5
Lastpage :
5
Abstract :
A new method named ENP-DOM tree is proposed in this paper, which extends the document object module tree by adding two properties, i.e., entropy and relativity, to some nodes. Semantic distance is used to extract the primary content accurately from the same source based on three facts: noise blocks always have high entropy property within a given Web site; primary content blocks are often made up of few link words and many text words; useful links are contained in a useful content blocks and have a close semantic distance with page titles. The proposed method can identify the primary content blocks with higher precision and recall rate and reduce the storage requirement for search engines; thus, result in smaller indexes, faster search time, and better user satisfaction. Extensive experiments are also conducted to evaluate the proposed method by comparison with existing methods. The experimental results show that the method outperforms existing methods with better satisfying recall rate and higher precision.
Keywords :
Web sites; document handling; entropy; search engines; trees (mathematics); ENP-DOM tree; Web page clusters; Web site; document object module tree; entropy; noise blocks; primary content block detection; primary content blocks; search engines; semantic distance; storage requirement; user satisfaction; Computer science; Data mining; Entropy; Filters; HTML; Object detection; Partitioning algorithms; Robustness; Search engines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Innovative Computing Information and Control, 2008. ICICIC '08. 3rd International Conference on
Conference_Location :
Dalian, Liaoning
Print_ISBN :
978-0-7695-3161-8
Electronic_ISBN :
978-0-7695-3161-8
Type :
conf
DOI :
10.1109/ICICIC.2008.430
Filename :
4603194
Link To Document :
بازگشت