مرکز منطقه ای اطلاع رساني علوم و فناوري - Identification of malicious web pages for crawling based on network-related attributes of web server

DocumentCode :

1607826

Title :

Identification of malicious web pages for crawling based on network-related attributes of web server

Author :

Hattori, Gen ; Matsumoto, Kazunori ; Ono, Chihiro ; Takishima, Yasuhiro

Author_Institution :

Intell. Media Process. Lab., KDDI R&D Labs. Inc., Saitama, Japan

fYear :

2010

Firstpage :

355

Lastpage :

361

Abstract :

In this paper, we propose an identification algorithm of malicious Web pages for crawlers, which collect Web pages for the later task to detect malicious Web pages based on the content. Recently, some organization would have to automatically crawl the Web pages with the crawlers for later checking by humans. However, since manually checking Web pages is an expensive task, the total cost would be enormous if the crawlers collected Web pages indiscriminately. Some automatically checking systems can make the human task more efficient, however, they cannot be used to increase the number of malicious Web pages. To solve these problems, we propose an efficient algorithm to determine whether the sites include malicious or dangerous content for crawling Web pages. The feature of the algorithm is that it can determine the probability of a site being malicious or harmless as calculated from the network-related attributes of the Web server derived from the URL string. The attributes refer to the domain name, directory name, and the IP (Internet Protocol) address of the nearest router from the Web server. To confirm the effectiveness of the proposed algorithm, we conducted an evaluation experiment in a simulated environment. We compared the number of the collected malicious Web pages by the proposed algorithm with that of a random sampling algorithm in the experiment. The advantage is +82.8% high in maximum on a stable condition. We also showed an example of crawling trajectories using the proposed algorithm and conventional crawling algorithms. The example showed that the proposed algorithm is able to collect more malicious Web pages than the conventional algorithms.

Keywords :

IP networks; Internet; security of data; IP address; URL string; Web page crawling; Web server; automatically checking system; directory name; domain name; malicious Web page identification; network-related attributes; random sampling algorithm; Crawlers; HTML; IP networks; Support vector machines; Training; Web pages; Web server; HTML; Web-crawling algorithm; component; information filtering; network-related attributes of Web server;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Universal Communication Symposium (IUCS), 2010 4th International

Conference_Location :

Beijing

Print_ISBN :

978-1-4244-7821-7

Type :

conf

DOI :

10.1109/IUCS.2010.5666254

Filename :

5666254

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1607826