DocumentCode :
2026918
Title :
Combining topic models and string kernel for deep web categorization
Author :
Xu, Guangyue ; Zheng, Weimin ; Wu, Haiping ; Yang, Yujiu
Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
Volume :
6
fYear :
2010
fDate :
10-12 Aug. 2010
Firstpage :
2791
Lastpage :
2795
Abstract :
Online databases maintain a collection of structured domain-specific documents dynamically generated in response to users´ queries instead of being accessed by static URLs. Categorizing deep webs according to their object domains is a critical step to integrate such sources. While existing methods focus on supervised or post-query methodologies, we propose a more practical pre-query algorithm operating in an unsupervised manner. Given the domain number, our two phase approach firstly investigates the hidden domain distribution for each query form using topic models and each query form´s object domain can be identified preliminarily. In this phase, we construct our training set composing the query forms deemed to have already been categorized correctly, and beside, the deep webs needed to be reclassified are also selected in this phase. In the second phase, we train a classifier with String Kernel methods to reclassify the uncertain deep webs to improve the overall performance. The advantage of our algorithm over previous ones is that we capture the semantic structure for each query form. Based on the two phase architecture, our framework works in an unsupervised manner and achieves satisfactory results. Experiments on the TEL-8 dataset from the UIUC Web integration repository1 show the effectiveness and efficiency of our algorithm.
Keywords :
Internet; document handling; query processing; TEL-8 dataset; UIUC Web integration repository; deep Web categorization; online databases; post query methodologies; pre query algorithm; static URL; string kernel; structured domain specific documents; topic models; training set; Atmospheric modeling; Books; Databases; Frequency modulation; Kernel; Semantics; Training;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on
Conference_Location :
Yantai, Shandong
Print_ISBN :
978-1-4244-5931-5
Type :
conf
DOI :
10.1109/FSKD.2010.5569236
Filename :
5569236
Link To Document :
بازگشت