• DocumentCode
    1913085
  • Title

    Mining the Web and the Internet for Accurate IP Address Geolocations

  • Author

    Guo, Chuanxiong ; Liu, Yunxin ; Shen, Wenchao ; Wang, Helen J. ; Yu, Qing ; Zhang, Yongguang

  • Author_Institution
    Microsoft Res., Redmond, WA
  • fYear
    2009
  • fDate
    19-25 April 2009
  • Firstpage
    2841
  • Lastpage
    2845
  • Abstract
    In this paper, we present Structon, a novel approach that uses Web mining together with inference and IP traceroute to geolocate IP addresses with significantly better accuracy than existing automated approaches. Structon is composed of three ideas which we realize in three corresponding steps. First, we extract geolocation information of Web server IP addresses from Web pages. Second, we devise heuristic algorithms to improve both the accuracy and the coverage of the IP geolocation database using these Web server IP addresses and their geolocations as input. Third, for those segments that are not covered in the first two steps, we use IP traceroute to identify the access routers of those segments. When the location of the access router is known, we can deduce the location of the associated segment since it is co-located together with the access router. By mining 500-million Web pages collected in China in 2006 (11 percent of the total Web pages in China at that time), we are able to identify the geolocations for 103 million IP addresses. This represents nearly 88 percent IP addresses allocated to China in March 2008. Structon is 87.4 percent accurate at city granularity and up to 93.5 percent accurate at province level. We also used 10 day Windows Live client log to evaluate our client IP addresses coverage: Structon identified geolocations of 98.9 percent of client IP addresses.
  • Keywords
    Internet; data mining; transport protocols; IP geolocation database; IP traceroute; Internet protocol; Structon; Web page mining; Web server IP address; Windows Live client log; access router; associated segment; client IP address coverage; geolocation information; heuristic algorithm; Cities and towns; Data mining; Databases; Inference algorithms; Internet; Pattern matching; Telephony; Web mining; Web pages; Web server;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    INFOCOM 2009, IEEE
  • Conference_Location
    Rio de Janeiro
  • ISSN
    0743-166X
  • Print_ISBN
    978-1-4244-3512-8
  • Electronic_ISBN
    0743-166X
  • Type

    conf

  • DOI
    10.1109/INFCOM.2009.5062243
  • Filename
    5062243