DocumentCode :
166480
Title :
Thai Related Foreign Language-Specific Website Segment Crawler
Author :
Rungsawang, Arnon ; Suebchua, Tanaphol ; Manaskasemsak, Bundit
Author_Institution :
Fac. of Eng., Dept. of Comput. Eng., Kasetsart Univ., Bangkok, Thailand
fYear :
2014
fDate :
13-16 May 2014
Firstpage :
293
Lastpage :
298
Abstract :
National web archive that preserves national knowledge for generations to come has been successfully made available through a domain-specific web crawler for years. However, that kind of crawler still misses many foreign language web pages that are also related to the nation. In this paper, we propose a new crawling approach to collect national related web pages written in a foreign language, especially the English web pages that relate to Thailand. We have proposed a notion of website segment which groups the related web pages from their same longest directory paths. Rather than exploring a target web page as proposed in many traditional focused crawling approaches, we train an ensemble classifier with several features to predict the relevancy of the website segments. The most relevant website segments in the crawling frontier are then enqueued to download. Preliminary experiments on the real web space show that this approach can provide better promising harvest results than the Breadth-First and Best-First baselines for the Thai-tourism and Thai-estate topics.
Keywords :
Web sites; information retrieval systems; natural language processing; pattern classification; English Web pages; Thai related foreign language; Thai-estate topics; Thai-tourism topics; Thailand; Web space; Website segment crawler; Website segments; best-first baselines; breadth-first baselines; crawling approach; directory paths; domain-specific Web crawler; ensemble classifier; foreign language Web pages; national Web archive; national knowledge; Crawlers; Encyclopedias; Feature extraction; Internet; Training; Vectors; Web pages; focused web crawler; language-specific web crawler; topic-specific web crawler; website segment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on
Conference_Location :
Victoria, BC
Print_ISBN :
978-1-4799-2652-7
Type :
conf
DOI :
10.1109/WAINA.2014.56
Filename :
6844653
Link To Document :
بازگشت