DocumentCode
2120002
Title
Extracting Web News Using Tag Path Patterns
Author
Gongqing Wu ; Xindong Wu
Author_Institution
Sch. of Comput. Sci. & Inf. Eng., Hefei Univ. of Technol., Hefei, China
Volume
1
fYear
2012
fDate
4-7 Dec. 2012
Firstpage
588
Lastpage
595
Abstract
How to accurately extract the content of Web news is a popular and significant issue in Web Intelligence. Many Web news sites have similar structures and layout styles, and there are potential correlations between Web content layouts and tag path patterns. Compared with other extraction features, such as HTML tags, literal words and visual features, a tag path pattern not only addresses content segments well, but also has an advantage in the generalization. However, can we accurately extract Web news using only tag path patterns? Motivated by this problem, we propose a PPWIE extraction model. We design an extraction algorithm WEtr using self-defined tag path patterns, and then define a special tag path pattern called the distinguishing tag path pattern. In addition, to tackle the NPC-hard problem in path pattern mining, we propose a polynomial-time (ln|n|+1)-approximation algorithm MPM, in which n indicates the scale of positive samples. Our experiments show that our integration method WEtr+MPM in PPWIE can achieve better performance with more than 98% of precision, recall and the F-score on real world datasets.
Keywords
Web sites; approximation theory; computational complexity; data mining; information retrieval; F-score value; NPC-hard problem; PPWIE extraction model; WEtr extraction algorithm design; WEtr+MPM integration method; Web Intelligence; Web content layouts; Web news content extraction features; Web news sites; content segments; distinguishing tag path pattern; path pattern mining; polynomial-time approximation algorithm; precision value; recall value; self-defined tag path patterns; Distinguishing Tag Path Pattern; Pattern Mining; Web Information Extraction; Web News;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on
Conference_Location
Macau
Print_ISBN
978-1-4673-6057-9
Type
conf
DOI
10.1109/WI-IAT.2012.107
Filename
6511946
Link To Document