DocumentCode :
751063
Title :
WISDOM: Web intrapage informative structure mining based on document object model
Author :
Kao, Hung-Yu ; Ho, Jan-Ming ; Chen, Ming-Syan
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
Volume :
17
Issue :
5
fYear :
2005
fDate :
5/1/2005 12:00:00 AM
Firstpage :
614
Lastpage :
627
Abstract :
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web intrapage informative structure mining based on the document object model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM´S practical applicability.
Keywords :
Internet; content management; data mining; document handling; information retrieval; information theory; search engines; tree searching; DOM tree knowledge; WISDOM; Web intrapage informative structure mining; content Web site; document object model; informative block searching algorithm; news Web site; Data mining; Entropy; IEEE news; Information theory; Joining processes; Merging; Navigation; Scalability; Search engines; Web pages; DOM; Index Terms- Intrapage informative structure; entropy; information extraction.;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.84
Filename :
1411741
Link To Document :
بازگشت