• DocumentCode
    3113185
  • Title

    Basic semantic units based web page content extraction

  • Author

    Wang, Jingqi ; Chen, Qingcai ; Wang, Xiaolong ; Guo, Hongzhi

  • Author_Institution
    Shenzhen Grad. Sch., Intell. Comput. Res. Center, Harbin Inst. of Technol., Harbin
  • fYear
    2008
  • fDate
    12-15 Oct. 2008
  • Firstpage
    1489
  • Lastpage
    1494
  • Abstract
    Web page content extraction can be achieved by node-based and segmentation-based algorithms respectively on top of the document object model (DOM). However, the node-based algorithm often removes content embedded as anchor text; while the segmentation-based way can not distinguish irrelevant text from content text when they are divided into the same segment. The two kinds of algorithms don´t keep the paragraph information of the original page either. In this paper, a new basic semantic unit (BSU) with granularity between nodes in the DOM tree and content block is defined. Two different methods based on BSU, using clustering and heuristic rules are developed to extract page content. The clustering method gets the best precision 96.88%; while the heuristic rules obtain the best F1-value 95.28%. Compared with the baseline method which uses text blocks segmented by <table>and <div>as Web page content, the F1-values are enhanced by 8.92% and 9.42% respectively.
  • Keywords
    content management; information retrieval; pattern clustering; semantic Web; text analysis; tree data structures; Web page content extraction; anchor text; clustering method; document object model tree; heuristic rule; node-based algorithm; segmentation-based algorithm; semantic unit; Clustering algorithms; Clustering methods; Data mining; Displays; Explosions; HTML; Size measurement; Sliding mode control; Testing; Web pages; basic semantic unit; content extraction; line break tag; page segmentation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on
  • Conference_Location
    Singapore
  • ISSN
    1062-922X
  • Print_ISBN
    978-1-4244-2383-5
  • Electronic_ISBN
    1062-922X
  • Type

    conf

  • DOI
    10.1109/ICSMC.2008.4811496
  • Filename
    4811496