DocumentCode
525671
Title
Web wrapper generation using tree alignment and transfer learning
Author
Xia, YingJu ; Zhang, Shu ; Yu, Hao
Author_Institution
Fujitsu R&D Center Co., Ltd., Beijing, China
fYear
2010
fDate
23-25 June 2010
Firstpage
410
Lastpage
415
Abstract
This paper studies the web wrapper generation for web pages of forum, blog and news web sites. While more and more web pages are dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. We present a new tree alignment algorithm to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. Based on the alignment, we merge the trees into one union tree whose nodes record the statistical information gotten from multiple web pages. We use a transfer learning method to find the most likely content block and use the alignment algorithm to detect the repeat patterns on the union tree. After that, we generate a wrapper to extract data from web pages. Experimental results show that the method can achieve high extraction accuracy and has steady performance.
Keywords
Internet; information retrieval; learning (artificial intelligence); Web blog; Web pages; Web sites; Web wrapper generation; data extraction; databases; input Web page matching structure; transfer learning method; tree alignment algorithm; union tree; Clustering algorithms; Data mining; Databases; Information services; Internet; Learning systems; Linear regression; Web pages; Web sites; Writing; alignment (key words); tree; wrapper;
fLanguage
English
Publisher
ieee
Conference_Titel
Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on
Conference_Location
Chengdu
Print_ISBN
978-1-4244-7324-3
Electronic_ISBN
978-89-88678-22-0
Type
conf
Filename
5542885
Link To Document