Title :
Prequery Discovery of Domain-Specific Query Forms: A Survey
Author :
Moraes, Marcos C. ; Heuser, C.A. ; Moreira, V.P. ; Barbosa, D.
Author_Institution :
Inst. of Inf., UFRGS, Porto Alegre, Brazil
Abstract :
The discovery of HTML query forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. This survey gives an up-to-date review of methods for the discovery of domain-specific query forms that do not involve form submission. We detail these methods and discuss how form discovery has become increasingly more automated over time. We conclude with a forecast of what we believe are the immediate next steps in this trend.
Keywords :
Web sites; hypermedia markup languages; query processing; HTML query form discovery; Web crawling; data source; domain knowledge; domain-specific query form prequery discovery; retrieved data analysis; semantic processing; Crawlers; HTML; Humans; Knowledge based systems; Manuals; Search engines; Semantics; Deep web; domain-specific search; hidden web; query form discovery;
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
DOI :
10.1109/TKDE.2012.111