• DocumentCode
    583146
  • Title

    Deep Web Repeated Pattern Discovering Based on the Largest Block Strategy

  • Author

    Ye, Feiyue ; Tang, Haibo ; Luo, Xiangfeng

  • Author_Institution
    Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China
  • fYear
    2012
  • fDate
    27-29 Oct. 2012
  • Firstpage
    1082
  • Lastpage
    1086
  • Abstract
    Repeated pattern is a common phenomenon in query result pages of deep web sites. The deep web back-end data can be accessed by mining repeated patterns. So far, most of the algorithms of discovering repeated pattern use traditional web information extraction methods. But the recall percentage and accuracy are not high. How to obtain the repeated pattern accurately and completely is still a difficulty. We propose a method based on the largest block strategy to discover such pattern. The core of the method is using the largest block strategy to discover the repeated pattern layer. We can quickly navigate to the region of the entity data, and then analyze the sub tree in this area, finally, get the simplified repeated pattern of the deep web site. According to the results of the experiment, this method can get the repeated pattern data more accurately and more completely than the traditional methods. It can also address the multi-pattern problem which has not been solved yet in other methods.
  • Keywords
    Internet; Web sites; data handling; Web information extraction methods; Web sites; deep web repeated pattern discovering; largest block strategy; pattern data; pattern mining; query result pages; Accuracy; Clustering algorithms; Data mining; Feature extraction; HTML; Web sites; Deep Web; Repeated Pattern; Web Information Extraction; the Largest Block;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Technology (CIT), 2012 IEEE 12th International Conference on
  • Conference_Location
    Chengdu
  • Print_ISBN
    978-1-4673-4873-7
  • Type

    conf

  • DOI
    10.1109/CIT.2012.220
  • Filename
    6392057