Deep Web Repeated Pattern Discovering Based on the Largest Block Strategy

Author

Ye, Feiyue ; Tang, Haibo ; Luo, Xiangfeng

Author_Institution

Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China

fYear

2012

fDate

27-29 Oct. 2012

Firstpage

1082

Lastpage

1086

Abstract

Repeated pattern is a common phenomenon in query result pages of deep web sites. The deep web back-end data can be accessed by mining repeated patterns. So far, most of the algorithms of discovering repeated pattern use traditional web information extraction methods. But the recall percentage and accuracy are not high. How to obtain the repeated pattern accurately and completely is still a difficulty. We propose a method based on the largest block strategy to discover such pattern. The core of the method is using the largest block strategy to discover the repeated pattern layer. We can quickly navigate to the region of the entity data, and then analyze the sub tree in this area, finally, get the simplified repeated pattern of the deep web site. According to the results of the experiment, this method can get the repeated pattern data more accurately and more completely than the traditional methods. It can also address the multi-pattern problem which has not been solved yet in other methods.

Keywords

Internet; Web sites; data handling; Web information extraction methods; Web sites; deep web repeated pattern discovering; largest block strategy; pattern data; pattern mining; query result pages; Accuracy; Clustering algorithms; Data mining; Feature extraction; HTML; Web sites; Deep Web; Repeated Pattern; Web Information Extraction; the Largest Block;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer and Information Technology (CIT), 2012 IEEE 12th International Conference on

Conference_Location

Chengdu

Print_ISBN

978-1-4673-4873-7

Type

conf

DOI

10.1109/CIT.2012.220

Filename

6392057