• DocumentCode
    14752
  • Title

    Prequery Discovery of Domain-Specific Query Forms: A Survey

  • Author

    Moraes, Marcos C. ; Heuser, C.A. ; Moreira, V.P. ; Barbosa, D.

  • Author_Institution
    Inst. of Inf., UFRGS, Porto Alegre, Brazil
  • Volume
    25
  • Issue
    8
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1830
  • Lastpage
    1848
  • Abstract
    The discovery of HTML query forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-specific query forms that do not involve form submission. We detail these methods and discuss how form discovery has become increasingly more automated over time. We conclude with a forecast of what we believe are the immediate next steps in this trend.
  • Keywords
    Web sites; hypermedia markup languages; query processing; HTML query form discovery; Web crawling; data source; domain knowledge; domain-specific query form prequery discovery; retrieved data analysis; semantic processing; Crawlers; HTML; Humans; Knowledge based systems; Manuals; Search engines; Semantics; Deep web; domain-specific search; hidden web; query form discovery;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.111
  • Filename
    6205753