DocumentCode :
1654847
Title :
Layered and Weighted Tree Matching Algorithm for Automatic Web Data Records Recognition
Author :
Shengsheng Shi ; Fuliang Quan ; Tao Xie ; Chunfeng Yuan ; Yihua Huang
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
fYear :
2013
Firstpage :
55
Lastpage :
60
Abstract :
Automated web data record analysis and recognition is an important issue for improving the automation of web information extraction. Some typical methods perform similar data record analysis based on Simple Tree Matching (STM) algorithm. However, STM assigns elements the same weight and neglects the different impacts of different types of elements. Thus much noise would be induced during data record analysis, which eventually decreases the precision of data record recognition. This paper proposes a Layered and Weighted Tree Matching (LWTM) algorithm. First, we propose a layered filtering strategy to filter out potential noise elements. Then, we propose a weighted tree matching algorithm which assigns different weights for different types of HTML elements in terms of their importance for data record analysis. Further we combine the layered filtering strategy and the weighted tree matching algorithm to further improve the analysis results. Experimental results show that the proposed LWTM algorithm outperforms the STM-based methods.
Keywords :
Internet; data analysis; information filtering; trees (mathematics); LWTM algorithm; STM algorithm; Web information extraction; automated Web data record analysis; automatic Web data records recognition; data record recognition; layered filtering strategy; layered tree matching algorithm; simple tree matching; weighted tree matching algorithm; Algorithm design and analysis; Data mining; Filtering; HTML; Vegetation; Web pages; HTML tag tree; data record recognition; layered and weighted tree matching; layered filtering strategy; simple tree matching; weighted tree matching;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information System and Application Conference (WISA), 2013 10th
Conference_Location :
Yangzhou
Print_ISBN :
978-1-4799-3218-4
Type :
conf
DOI :
10.1109/WISA.2013.19
Filename :
6778610
Link To Document :
بازگشت