DocumentCode :
2191002
Title :
On Finding Similar Items in a Stream of Transactions
Author :
Campagna, Andrea ; Pagh, Rasmus
Author_Institution :
IT, Univ. of Copenhagen, Copenhagen, Denmark
fYear :
2010
fDate :
13-13 Dec. 2010
Firstpage :
121
Lastpage :
128
Abstract :
While there has been a lot of work on finding frequent item sets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures. This paper is a first attempt at dealing with this, arguably more important, problem. We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent item sets: Any algorithm that (even only approximately and with a chance of error) finds the most frequent k-item set must use space Ω(min{mb, nk, (mb/φ)k}) bits, where mb is the number of items in the stream so far, n is the number of distinct items and phi is a support threshold. To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream. We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy improves with the length of the stream for any fixed support threshold.
Keywords :
data mining; sampling methods; set theory; transaction processing; association rule; data mining; transaction data streaming; algorithms; association rules; data mining; sampling; streaming;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-9244-2
Electronic_ISBN :
978-0-7695-4257-7
Type :
conf
DOI :
10.1109/ICDMW.2010.152
Filename :
5693291
Link To Document :
بازگشت