DocumentCode :
3343650
Title :
The website census
Author :
Qadeer, A. ; Mahmood, W. ; Waheed, A.
Author_Institution :
Al-Khawarizmi Inst. of Comput. Sci., Univ. of Eng. & Technol., Lahore, Pakistan
fYear :
2009
fDate :
9-12 Nov. 2009
Firstpage :
1
Lastpage :
6
Abstract :
The website census is an effort to enumerate all the websites on the World Wide Web (WWW) without using crawling. Crawling is a traditional way of website discovery. It is conceptually simple but the very size of the WWW makes the implementation complex and resource demanding. The enormous amount of bandwidth, a huge persistent storage pool, a sufficiently large cluster of machines for data processing and a complex set of software systems are just a few examples of the needed resources. In this work, we use exhaustive IP range probing to detect the presence of a web server on TCP port 80. Although this probing is exhaustive in nature, it is lightweight in terms of resource demands. This enumeration of websites has many applications. The most obvious is to use it as a seed to the conventional crawling. It can be refined to be used as a top level domain (TLD) specific seed for targeted crawling.
Keywords :
IP networks; Internet; Web sites; search engines; transport protocols; IP range; TCP; WWW; crawling; data processing; software systems; storage pool; targeted crawling; top level domain specific seed; web server; website census; website discovery; world wide web; Bandwidth; Computer networks; Computer science; Crawlers; Internet; Search engines; Software systems; Uniform resource locators; Web sites; World Wide Web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Internet Technology and Secured Transactions, 2009. ICITST 2009. International Conference for
Conference_Location :
London
Print_ISBN :
978-1-4244-5647-5
Type :
conf
DOI :
10.1109/ICITST.2009.5402623
Filename :
5402623
Link To Document :
بازگشت