Title :
Web Crawlers on a Health Related Portal: Detection, Characterisation and Implications
Author :
Jawaheer, Gawesh ; Kostkova, Patty
Abstract :
Web crawlers are automated computer programs that visit websites in order to download their content. They are employed for non-malicious (search engine crawlers indexing websites) and malicious purposes (those breaching privacy by harvesting email addresses for unsolicited email promotion and spam databases). Whatever their usage, web crawlers need to be accurately identified in an analysis of the overall traffic to a website. Visits from web crawlers as well as from genuine users are recorded in the web server logs. In this paper, we analyse the web server logs of NRIC, a health related portal. We present the techniques used to identify malicious and non-malicious web crawlers from these logs, using a blacklist database and analysis of the characteristics of the online behaviour of malicious crawlers. We use visualisation to carry out sanity checks along the crawler removal process. We illustrate the use of these techniques using 3 months of web server logs from NRIC. We use a combination of visualisation and baseline measures from Google Analytics to demonstrate the efficacy of our techniques. Finally, we discuss the implications of our work on the analysis of the web traffic to a website using web server logs and on the interpretation of the results from such analysis.
Keywords :
Web sites; data privacy; data visualisation; medical information systems; search engines; unsolicited e-mail; Google Analytics; NRIC; Web crawlers; Web server logs; Web traffic; Websites; automated computer programs; baseline measures; blacklist database; databases spamming; email address harvesting; health related portal; malicious purposes; nonmalicious purposes; online malicious crawler behaviour; privacy breach; unsolicited email promotion; visualisation measures; Browsers; Crawlers; Databases; Google; IP networks; Web pages; Web servers; crawlers; visualisation; web analytics; web server logs;
Conference_Titel :
Developments in E-systems Engineering (DeSE), 2011
Conference_Location :
Dubai
Print_ISBN :
978-1-4577-2186-1
DOI :
10.1109/DeSE.2011.83