DocumentCode :
3022300
Title :
Crawl Topical Vietnamese Web Pages Using Genetic Algorithm
Author :
Nhan, Nguyen Quoc ; Son, Vu Tuan ; Huynh Thi Thanh Binh ; Khanh, Tran Duc
Author_Institution :
Sch. of Inf. & Commun. Technol., Hanoi Univ. of Technol., Hanoi, Vietnam
fYear :
2010
fDate :
7-9 Oct. 2010
Firstpage :
217
Lastpage :
223
Abstract :
A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. In this paper, we propose a crawler system using genetic algorithm to improve its crawling performance. Apart from estimating the best path to follow, our system also expands its initial keywords by using genetic algorithm during the crawling process. To crawl Vietnamese web pages, we apply a hybrid word segmentation approach which consists of combining automata and part of speech tagging techniques for the Vietnamese text classifier. We experiment our algorithm on Vietnamese websites. Experimental results are reported to show the efficiency of our system.
Keywords :
Internet; Web sites; automata theory; genetic algorithms; information retrieval; text analysis; Vietnamese text classifier; Vietnamese websites; automata; crawl topical Vietnamese Web pages; genetic algorithm; hybrid word segmentation; internet browsing; speech tagging; Arrays; Automata; Biological cells; Business; Crawlers; Search engines; Web pages; Focused Crawler; Genetic Algorithm; Keyword; Vietnamese Word Segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Knowledge and Systems Engineering (KSE), 2010 Second International Conference on
Conference_Location :
Hanoi
Print_ISBN :
978-1-4244-8334-1
Type :
conf
DOI :
10.1109/KSE.2010.25
Filename :
5632006
Link To Document :
بازگشت