• DocumentCode
    2768341
  • Title

    Web News Extraction Based on Path Pattern Mining

  • Author

    Wu, Gong-Qing ; Wu, Xindong ; Hu, Xue-Gang ; Li, Hai-Guang ; Liu, Ying ; Xu, Ren-Gan

  • Author_Institution
    Sch. of Comput. Sci. & Inf. Eng., Hefei Univ. of Technol., Hefei, China
  • Volume
    7
  • fYear
    2009
  • fDate
    14-16 Aug. 2009
  • Firstpage
    612
  • Lastpage
    617
  • Abstract
    Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and path patterns. Compared with the delimiting features of Web content, path patterns have many advantages, such as a high positioning accuracy, ease of use and a strong pervasive performance. Consequently, a Web information extraction model with path patterns constructed from a path pattern mining algorithm is proposed in this paper. Our experimental data set is obtained by randomly selecting news Web pages from the CNN website. With a reasonable tolerance threshold, the experimental results show that the average precision is above 99% and the average recall is 100% when we integrate Web information extraction with our path pattern mining algorithm. The performance of path patterns from the pattern mining algorithm is much better than that of priori extraction rules configured by domain knowledge.
  • Keywords
    Internet; Web sites; data mining; information retrieval; CNN website; Web content layouts; Web information extraction model; Web news extraction; Web news sites; news Web pages; path pattern mining; path patterns; pervasive performance; Cellular neural networks; Computer science; Data mining; Explosions; Fuzzy systems; HTML; Internet; Knowledge engineering; Navigation; Web pages; Web news; information extraction; path pattern; pattern mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery, 2009. FSKD '09. Sixth International Conference on
  • Conference_Location
    Tianjin
  • Print_ISBN
    978-0-7695-3735-1
  • Type

    conf

  • DOI
    10.1109/FSKD.2009.672
  • Filename
    5360082