DocumentCode
583146
Title
Deep Web Repeated Pattern Discovering Based on the Largest Block Strategy
Author
Ye, Feiyue ; Tang, Haibo ; Luo, Xiangfeng
Author_Institution
Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
fYear
2012
fDate
27-29 Oct. 2012
Firstpage
1082
Lastpage
1086
Abstract
Repeated pattern is a common phenomenon in query result pages of deep web sites. The deep web back-end data can be accessed by mining repeated patterns. So far, most of the algorithms of discovering repeated pattern use traditional web information extraction methods. But the recall percentage and accuracy are not high. How to obtain the repeated pattern accurately and completely is still a difficulty. We propose a method based on the largest block strategy to discover such pattern. The core of the method is using the largest block strategy to discover the repeated pattern layer. We can quickly navigate to the region of the entity data, and then analyze the sub tree in this area, finally, get the simplified repeated pattern of the deep web site. According to the results of the experiment, this method can get the repeated pattern data more accurately and more completely than the traditional methods. It can also address the multi-pattern problem which has not been solved yet in other methods.
Keywords
Internet; Web sites; data handling; Web information extraction methods; Web sites; deep web repeated pattern discovering; largest block strategy; pattern data; pattern mining; query result pages; Accuracy; Clustering algorithms; Data mining; Feature extraction; HTML; Web sites; Deep Web; Repeated Pattern; Web Information Extraction; the Largest Block;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer and Information Technology (CIT), 2012 IEEE 12th International Conference on
Conference_Location
Chengdu
Print_ISBN
978-1-4673-4873-7
Type
conf
DOI
10.1109/CIT.2012.220
Filename
6392057
Link To Document