Title :
An Automatic Label Extraction Technique for Domain-Specific Hidden Web Crawling (LEHW)
Author :
El-Desouky, Ali I. ; Ali, Hesham A. ; El-Ghamrawy, Sally M.
Author_Institution :
Dept. of Comput. & Syst., Mansoura Univ.
Abstract :
General-purpose search engines (e.g. Google and Yahoo) ignore valuable data that represent 80% of the content on the Web, this portion of Web called hidden Web (HW). Pages in the hidden Web are dynamically generated in response to queries submitted via the search forms. In this paper, a new algorithm for extracting labels from multi-attribute (M-A) search form fields is proposed. A technique for automatic query generation for single-attribute (S-A) search forms is also provided in order to enhance the performance of the overall domain-specific hidden Web crawlers. The innovation of (LEHW) algorithm is its capability to distinguish between (S-A) and (M-A) forms; so that the capability of dealing with both of them, unlike most hidden Web crawlers that ignore either of them. Embedding of the proposed algorithm within the overall framework of the HW crawler is evaluated through experiments using real Web sites. The preliminary results demonstrate the accuracy and precision of the proposed approach for most of the sites considered
Keywords :
Web sites; query formulation; search engines; Web information extraction; Web site; automatic label extraction; automatic query generation; domain-specific hidden Web crawling; multiattribute search; search engine; single-attribute search form; Automatic control; Crawlers; Data mining; Filling; HTML; Radio control; Search engines; Technological innovation; Uniform resource locators; Web pages; Crawling; HTML search Forms; Hidden Web; Query generation; Search engines; Web information extraction;
Conference_Titel :
Computer Engineering and Systems, The 2006 International Conference on
Conference_Location :
Cairo
Print_ISBN :
1-4244-0271-9
Electronic_ISBN :
1-4244-0272-7
DOI :
10.1109/ICCES.2006.320490