DocumentCode :
525671
Title :
Web wrapper generation using tree alignment and transfer learning
Author :
Xia, YingJu ; Zhang, Shu ; Yu, Hao
Author_Institution :
Fujitsu R&D Center Co., Ltd., Beijing, China
fYear :
2010
fDate :
23-25 June 2010
Firstpage :
410
Lastpage :
415
Abstract :
This paper studies the web wrapper generation for web pages of forum, blog and news web sites. While more and more web pages are dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. We present a new tree alignment algorithm to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. Based on the alignment, we merge the trees into one union tree whose nodes record the statistical information gotten from multiple web pages. We use a transfer learning method to find the most likely content block and use the alignment algorithm to detect the repeat patterns on the union tree. After that, we generate a wrapper to extract data from web pages. Experimental results show that the method can achieve high extraction accuracy and has steady performance.
Keywords :
Internet; information retrieval; learning (artificial intelligence); Web blog; Web pages; Web sites; Web wrapper generation; data extraction; databases; input Web page matching structure; transfer learning method; tree alignment algorithm; union tree; Clustering algorithms; Data mining; Databases; Information services; Internet; Learning systems; Linear regression; Web pages; Web sites; Writing; alignment (key words); tree; wrapper;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-7324-3
Electronic_ISBN :
978-89-88678-22-0
Type :
conf
Filename :
5542885
Link To Document :
بازگشت