• DocumentCode
    525671
  • Title

    Web wrapper generation using tree alignment and transfer learning

  • Author

    Xia, YingJu ; Zhang, Shu ; Yu, Hao

  • Author_Institution
    Fujitsu R&D Center Co., Ltd., Beijing, China
  • fYear
    2010
  • fDate
    23-25 June 2010
  • Firstpage
    410
  • Lastpage
    415
  • Abstract
    This paper studies the web wrapper generation for web pages of forum, blog and news web sites. While more and more web pages are dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. We present a new tree alignment algorithm to find the best matching structure of the input web pages. A kind of linear regression method is employed to get the weight of different tag-matching. Based on the alignment, we merge the trees into one union tree whose nodes record the statistical information gotten from multiple web pages. We use a transfer learning method to find the most likely content block and use the alignment algorithm to detect the repeat patterns on the union tree. After that, we generate a wrapper to extract data from web pages. Experimental results show that the method can achieve high extraction accuracy and has steady performance.
  • Keywords
    Internet; information retrieval; learning (artificial intelligence); Web blog; Web pages; Web sites; Web wrapper generation; data extraction; databases; input Web page matching structure; transfer learning method; tree alignment algorithm; union tree; Clustering algorithms; Data mining; Databases; Information services; Internet; Learning systems; Linear regression; Web pages; Web sites; Writing; alignment (key words); tree; wrapper;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on
  • Conference_Location
    Chengdu
  • Print_ISBN
    978-1-4244-7324-3
  • Electronic_ISBN
    978-89-88678-22-0
  • Type

    conf

  • Filename
    5542885