• DocumentCode
    3081319
  • Title

    HiCrawl: A Hidden Web Crawler for Medical Domain

  • Author

    Gupta, Swastik ; Bhatia, Komal Kumar

  • Author_Institution
    Dept. of Comput. Eng., YMCA Univ. of Sci. & Technol., Faridabad, India
  • fYear
    2013
  • fDate
    24-26 Aug. 2013
  • Firstpage
    152
  • Lastpage
    157
  • Abstract
    The Hidden Web refers to a huge portion of the WWW that holds numerous freely accessible Web databases, hidden behind search form interfaces which can only be accessed through dynamic web pages that are generated in response to the user queries issued at the search form interface. Thus, the core challenge to implement any crawler for the Hidden Web is to routinely surpass these search form interfaces by automatically generating & issuing queries that help discover these dynamic Web pages. The paper provides a novel approach to guide the crawler in choosing the right query term to be submitted to any search form interface that has been designed to accept keywords or terms as input to it. The system is based on the use of classification hierarchies that might have either been manually or automatically constructed. And for the purposes of illustration, we have considered the search form interfaces in the ´Medical´ domain, it being one of the most popular domains used by the researchers and the use of a manually generated top-down classification hierarchy in the same domain.
  • Keywords
    Internet; information retrieval; medical information systems; search engines; HiCrawl; Web database; World Wide Web; classification hierarchy; dynamic Web pages; hidden Web crawler; medical domain; Crawlers; Databases; Larynx; Lungs; Nose; Web pages; Content Retrieval; Hidden Web; Surface Web; WWW; automatic form filling; crawlers; form processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational and Business Intelligence (ISCBI), 2013 International Symposium on
  • Conference_Location
    New Delhi
  • Print_ISBN
    978-0-7695-5066-4
  • Type

    conf

  • DOI
    10.1109/ISCBI.2013.39
  • Filename
    6724343