DocumentCode :
1607826
Title :
Identification of malicious web pages for crawling based on network-related attributes of web server
Author :
Hattori, Gen ; Matsumoto, Kazunori ; Ono, Chihiro ; Takishima, Yasuhiro
Author_Institution :
Intell. Media Process. Lab., KDDI R&D Labs. Inc., Saitama, Japan
fYear :
2010
Firstpage :
355
Lastpage :
361
Abstract :
In this paper, we propose an identification algorithm of malicious Web pages for crawlers, which collect Web pages for the later task to detect malicious Web pages based on the content. Recently, some organization would have to automatically crawl the Web pages with the crawlers for later checking by humans. However, since manually checking Web pages is an expensive task, the total cost would be enormous if the crawlers collected Web pages indiscriminately. Some automatically checking systems can make the human task more efficient, however, they cannot be used to increase the number of malicious Web pages. To solve these problems, we propose an efficient algorithm to determine whether the sites include malicious or dangerous content for crawling Web pages. The feature of the algorithm is that it can determine the probability of a site being malicious or harmless as calculated from the network-related attributes of the Web server derived from the URL string. The attributes refer to the domain name, directory name, and the IP (Internet Protocol) address of the nearest router from the Web server. To confirm the effectiveness of the proposed algorithm, we conducted an evaluation experiment in a simulated environment. We compared the number of the collected malicious Web pages by the proposed algorithm with that of a random sampling algorithm in the experiment. The advantage is +82.8% high in maximum on a stable condition. We also showed an example of crawling trajectories using the proposed algorithm and conventional crawling algorithms. The example showed that the proposed algorithm is able to collect more malicious Web pages than the conventional algorithms.
Keywords :
IP networks; Internet; security of data; IP address; URL string; Web page crawling; Web server; automatically checking system; directory name; domain name; malicious Web page identification; network-related attributes; random sampling algorithm; Crawlers; HTML; IP networks; Support vector machines; Training; Web pages; Web server; HTML; Web-crawling algorithm; component; information filtering; network-related attributes of Web server;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Universal Communication Symposium (IUCS), 2010 4th International
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-7821-7
Type :
conf
DOI :
10.1109/IUCS.2010.5666254
Filename :
5666254
Link To Document :
بازگشت