Title :
Catch Crawler: Automatic Web Information Extractor Using Style Sheet
Author :
Shin, Kwangcheol ; Jo, Geun Sik
Author_Institution :
Sch. of Comput. & Inf. Eng., Inha Univ., Incheon
Abstract :
Dataset should be free from noise for carrying out tasks of Web mining well. Generally commercial Web pages have a lot of noise which are not relevant to main contents such as navigation panel, advertisements, copyright notices or other service links. In this paper, we present a new automatic Web information extractor called dasiacatch crawlerpsila which uses style sheet to extract interesting data on a target site. Style sheets are generally used for uniform presentation of Web pages in a commercial Web site. To execute catch Crawler, a user lets catch Crawler know the interesting data area by clicking the data on a Web page. The catch Crawler automatically perceives the class of style sheet for the data and generates dataset from the whole Web site following the same style sheet class. Experimental results show that our approach for extracting noiseless Web data gives over 90% of accuracy on average.
Keywords :
Web services; Web sites; data mining; feature extraction; Web mining; Web page; automatic Web information extractor; catch crawler; commercial Web pages; dataset; service links; style sheet; Application software; Cleaning; Computer applications; Conferences; Crawlers; Data mining; Entropy; Web mining; Web pages; XML; Information Extraction; Web page cleaning;
Conference_Titel :
Semantic Computing and Applications, 2008. IWSCA '08. IEEE International Workshop on
Conference_Location :
Incheon
Print_ISBN :
978-0-7695-3317-9
Electronic_ISBN :
978-0-7695-3317-9
DOI :
10.1109/IWSCA.2008.23