مرکز منطقه ای اطلاع رساني علوم و فناوري - Catch Crawler: Automatic Web Information Extractor Using Style Sheet

DocumentCode :

2273195

Title :

Catch Crawler: Automatic Web Information Extractor Using Style Sheet

Author :

Shin, Kwangcheol ; Jo, Geun Sik

Author_Institution :

Sch. of Comput. & Inf. Eng., Inha Univ., Incheon

fYear :

2008

fDate :

10-11 July 2008

Firstpage :

Lastpage :

102

Abstract :

Dataset should be free from noise for carrying out tasks of Web mining well. Generally commercial Web pages have a lot of noise which are not relevant to main contents such as navigation panel, advertisements, copyright notices or other service links. In this paper, we present a new automatic Web information extractor called dasiacatch crawlerpsila which uses style sheet to extract interesting data on a target site. Style sheets are generally used for uniform presentation of Web pages in a commercial Web site. To execute catch Crawler, a user lets catch Crawler know the interesting data area by clicking the data on a Web page. The catch Crawler automatically perceives the class of style sheet for the data and generates dataset from the whole Web site following the same style sheet class. Experimental results show that our approach for extracting noiseless Web data gives over 90% of accuracy on average.

Keywords :

Web services; Web sites; data mining; feature extraction; Web mining; Web page; automatic Web information extractor; catch crawler; commercial Web pages; dataset; service links; style sheet; Application software; Cleaning; Computer applications; Conferences; Crawlers; Data mining; Entropy; Web mining; Web pages; XML; Information Extraction; Web page cleaning;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Semantic Computing and Applications, 2008. IWSCA '08. IEEE International Workshop on

Conference_Location :

Incheon

Print_ISBN :

978-0-7695-3317-9

Electronic_ISBN :

978-0-7695-3317-9

Type :

conf

DOI :

10.1109/IWSCA.2008.23

Filename :

4573159

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2273195