DocumentCode
3113185
Title
Basic semantic units based web page content extraction
Author
Wang, Jingqi ; Chen, Qingcai ; Wang, Xiaolong ; Guo, Hongzhi
Author_Institution
Shenzhen Grad. Sch., Intell. Comput. Res. Center, Harbin Inst. of Technol., Harbin
fYear
2008
fDate
12-15 Oct. 2008
Firstpage
1489
Lastpage
1494
Abstract
Web page content extraction can be achieved by node-based and segmentation-based algorithms respectively on top of the document object model (DOM). However, the node-based algorithm often removes content embedded as anchor text; while the segmentation-based way can not distinguish irrelevant text from content text when they are divided into the same segment. The two kinds of algorithms don´t keep the paragraph information of the original page either. In this paper, a new basic semantic unit (BSU) with granularity between nodes in the DOM tree and content block is defined. Two different methods based on BSU, using clustering and heuristic rules are developed to extract page content. The clustering method gets the best precision 96.88%; while the heuristic rules obtain the best F1-value 95.28%. Compared with the baseline method which uses text blocks segmented by <table>and <div>as Web page content, the F1-values are enhanced by 8.92% and 9.42% respectively.
Keywords
content management; information retrieval; pattern clustering; semantic Web; text analysis; tree data structures; Web page content extraction; anchor text; clustering method; document object model tree; heuristic rule; node-based algorithm; segmentation-based algorithm; semantic unit; Clustering algorithms; Clustering methods; Data mining; Displays; Explosions; HTML; Size measurement; Sliding mode control; Testing; Web pages; basic semantic unit; content extraction; line break tag; page segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
Conference_Location
Singapore
ISSN
1062-922X
Print_ISBN
978-1-4244-2383-5
Electronic_ISBN
1062-922X
Type
conf
DOI
10.1109/ICSMC.2008.4811496
Filename
4811496
Link To Document