An Approach to Extracting Central URLs on Catalog Page

Author

Bai, He ; Wang, JinLin ; Li, Ye

Author_Institution

Nat. Network New Media Eng. Res. Center, Chinese Acad. of Sci., Beijing

fYear

2008

fDate

21-22 Dec. 2008

Firstpage

388

Lastpage

392

Abstract

Catalog pages construct the intermediate layer in architecture of a standard Web site; therefore research on information retrieval for this kind of pages can be beneficial to improve Web crawler\´s efficiency. A page is called "catalog-style" if its main body is displayed as a sequence of regular entries, and the central link in each entry apparently contains the pagepsilas major information. Here, we propose a central-URL extraction approach, which can automatically recognize effective information from the main segmentation on catalog-page. Our approach combines machine learning classification and DOM (document object model) tree based analysis. For one page, we represent each block node, mainly DIV and table, by a set of content-based and structure-based features, which can be used as the input of support vector machine to determine whether it belongs to "main-body" or not. After identifying the main semantic block, a DOM tree based algorithm that utilizes catalog\´s heuristic rules is implemented to find the central URLs in the segmentation. The evaluation results show that our approach obtains encouraging results with a high recall/precision ratio. This can be applied in topic-specific search engine development and other Web applications.

Keywords

Internet; cataloguing; document handling; information retrieval; learning (artificial intelligence); pattern classification; search engines; support vector machines; tree data structures; DOM tree based analysis; Web crawler; Web site; catalog page; central URL extraction; content-based feature; document object model; heuristic rule; information retrieval; machine learning classification; structure-based feature; support vector machine; topic-specific search engine development; Crawlers; Data mining; Information retrieval; Search engines; Service oriented architecture; Support vector machine classification; Support vector machines; Text categorization; Uniform resource locators; Web pages; Machine Learning; Web Information Retrival; Web Segmentation; Web URL Extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Knowledge Acquisition and Modeling, 2008. KAM '08. International Symposium on

Conference_Location

Wuhan

Print_ISBN

978-0-7695-3488-6

Type

conf

DOI

10.1109/KAM.2008.71

Filename

4732851