DocumentCode :
3088492
Title :
A Machine Learning Based Language Specific Web Site Crawler
Author :
Tadapak, Punnawat ; Suebchua, Thanaphon ; Rungsawang, Arnon
Author_Institution :
Dept. of Comput. Eng., Kasetsart Univ., Bangkok, Thailand
fYear :
2010
fDate :
14-16 Sept. 2010
Firstpage :
155
Lastpage :
161
Abstract :
We propose an approach for gathering web pages written in a specific language. The approach consists of a language predictor and a web site crawler. The language predictor is a machine learning based component that can learn from an example host graph some characteristics of relevant hosts, and is used to calculate the language degree of a web server whether it has a high probability to serve web pages written in a target language. The site crawler, on the other hand, chooses to download the web pages from a prioritized list of relevant servers. We have evaluated the crawling performance in terms of coverage and harvest rates. Preliminary experiments using a Thai web data set show a promising result, comparing with the traditional language-specific crawling methods recently proposed in the literatures.
Keywords :
Internet; Web sites; information retrieval; learning (artificial intelligence); natural language interfaces; Web page; language predictor; language specific Web site crawler; machine learning; Crawlers; Feature extraction; Testing; Web pages; Web server; Language-specific web crawler; Machine-Learning; Web site crawler;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Network-Based Information Systems (NBiS), 2010 13th International Conference on
Conference_Location :
Takayama
ISSN :
2157-0418
Print_ISBN :
978-1-4244-8053-1
Electronic_ISBN :
2157-0418
Type :
conf
DOI :
10.1109/NBiS.2010.25
Filename :
5635898
Link To Document :
بازگشت