Direct Discovery of High Utility Itemsets without Candidate Generation

Author

Junqiang Liu ; Ke Wang ; Fung, Benjamin C. M.

Author_Institution

Inf. & Electron. Eng., Zhejiang Gongshang Univ., Hangzhou, China

fYear

2012

fDate

10-13 Dec. 2012

Firstpage

984

Lastpage

989

Abstract

Utility mining emerged recently to address the limitation of frequent itemset mining by introducing interestingness measures that reflect both the statistical significance and the user´s expectation. Among utility mining problems, utility mining with the itemset share framework is a hard one as no anti-monotone property holds with the interestingness measure. The state-of-the-art works on this problem all employ a two-phase, candidate generation approach, which suffers from the scalability issue due to the huge number of candidates. This paper proposes a high utility itemset growth approach that works in a single phase without generating candidates. Our basic approach is to enumerate itemsets by prefix extensions, to prune search space by utility upper bounding, and to maintain original utility information in the mining process by a novel data structure. Such a data structure enables us to compute a tight bound for powerful pruning and to directly identify high utility itemsets in an efficient and scalable way. We further enhance the efficiency significantly by introducing recursive irrelevant item filtering with sparse data, and a lookahead strategy with dense data. Extensive experiments on sparse and dense, synthetic and real data suggest that our algorithm outperforms the state-of-the-art algorithms over one order of magnitude.

Keywords

data mining; information filtering; candidate generation approach; dense data; frequent itemset mining; high utility itemset discovery; high utility itemset growth approach; interestingness measure; lookahead strategy; prefix extension; recursive irrelevant item filtering; sparse data; statistical significance; user expectation; utility mining; utility upper bounding; Data mining; Educational institutions; Electronic mail; Itemsets; Scalability; Upper bound; Utility mining; frequent itemsets; high utility itemsets; pattern mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining (ICDM), 2012 IEEE 12th International Conference on

Conference_Location

Brussels

ISSN

1550-4786

Print_ISBN

978-1-4673-4649-8

Type

conf

DOI

10.1109/ICDM.2012.20

Filename

6413821