DocumentCode :
3144877
Title :
Active learning based frequent itemset mining over the deep web
Author :
Liu, Tantan ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA
fYear :
2011
fDate :
11-16 April 2011
Firstpage :
219
Lastpage :
230
Abstract :
In recent years, one mode of data dissemination has become extremely popular, which is the deep web. A key characteristics of deep web data sources is that data can only be accessed through the limited query interface they support. This paper develops a methodology for mining the deep web. Because these data sources cannot be accessed directly, thus, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs. Unlike existing sampling based methods, which are typically applied on relational databases or streaming data, sampling costs, and not the computation or memory costs, are the dominant consideration in designing the algorithm. In this paper, we specifically target the frequent itemset mining problem, and develop a method based on the theory of active learning. We focus on effectively obtaining a sample that can achieve a good estimation for the support values of 1-itemsets comprising an output attribute. In our method, a Bayesian network is utilized to describe the relationship between the input and the output attributes. We have evaluated our method using one synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of active learning and modeling a deep web source with a Bayesian network. On all three datasets, by sampling less than 10% of all data records, we could achieve more than 95% accuracy in estimating the support of frequent itemsets.
Keywords :
Bayes methods; data mining; information dissemination; learning (artificial intelligence); query processing; relational databases; sampling methods; Bayesian network; active learning; data access; data dissemination; data mining; data streaming; dataset sampling; deep web databases; frequent itemset mining; query interface; relational databases; synthetic datasets; Accuracy; Bayesian methods; Data mining; Estimation; Itemsets; Sampling methods;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
ISSN :
1063-6382
Print_ISBN :
978-1-4244-8959-6
Electronic_ISBN :
1063-6382
Type :
conf
DOI :
10.1109/ICDE.2011.5767919
Filename :
5767919
Link To Document :
بازگشت