Title :
An Approach to Extracting Central URLs on Catalog Page
Author :
Bai, He ; Wang, JinLin ; Li, Ye
Author_Institution :
Nat. Network New Media Eng. Res. Center, Chinese Acad. of Sci., Beijing
Abstract :
Catalog pages construct the intermediate layer in architecture of a standard Web site; therefore research on information retrieval for this kind of pages can be beneficial to improve Web crawler\´s efficiency. A page is called "catalog-style" if its main body is displayed as a sequence of regular entries, and the central link in each entry apparently contains the pagepsilas major information. Here, we propose a central-URL extraction approach, which can automatically recognize effective information from the main segmentation on catalog-page. Our approach combines machine learning classification and DOM (document object model) tree based analysis. For one page, we represent each block node, mainly DIV and table, by a set of content-based and structure-based features, which can be used as the input of support vector machine to determine whether it belongs to "main-body" or not. After identifying the main semantic block, a DOM tree based algorithm that utilizes catalog\´s heuristic rules is implemented to find the central URLs in the segmentation. The evaluation results show that our approach obtains encouraging results with a high recall/precision ratio. This can be applied in topic-specific search engine development and other Web applications.
Keywords :
Internet; cataloguing; document handling; information retrieval; learning (artificial intelligence); pattern classification; search engines; support vector machines; tree data structures; DOM tree based analysis; Web crawler; Web site; catalog page; central URL extraction; content-based feature; document object model; heuristic rule; information retrieval; machine learning classification; structure-based feature; support vector machine; topic-specific search engine development; Crawlers; Data mining; Information retrieval; Search engines; Service oriented architecture; Support vector machine classification; Support vector machines; Text categorization; Uniform resource locators; Web pages; Machine Learning; Web Information Retrival; Web Segmentation; Web URL Extraction;
Conference_Titel :
Knowledge Acquisition and Modeling, 2008. KAM '08. International Symposium on
Conference_Location :
Wuhan
Print_ISBN :
978-0-7695-3488-6
DOI :
10.1109/KAM.2008.71