• DocumentCode
    751063
  • Title

    WISDOM: Web intrapage informative structure mining based on document object model

  • Author

    Kao, Hung-Yu ; Ho, Jan-Ming ; Chen, Ming-Syan

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
  • Volume
    17
  • Issue
    5
  • fYear
    2005
  • fDate
    5/1/2005 12:00:00 AM
  • Firstpage
    614
  • Lastpage
    627
  • Abstract
    To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM´S practical applicability.
  • Keywords
    Internet; content management; data mining; document handling; information retrieval; information theory; search engines; tree searching; DOM tree knowledge; WISDOM; Web intrapage informative structure mining; content Web site; document object model; informative block searching algorithm; news Web site; Data mining; Entropy; IEEE news; Information theory; Joining processes; Merging; Navigation; Scalability; Search engines; Web pages; DOM; Index Terms- Intrapage informative structure; entropy; information extraction.;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2005.84
  • Filename
    1411741