Title :
A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation
Author :
Malbasa, Vuk ; Vucetic, Slobodan
Author_Institution :
Temple Univ., Philadelphia
Abstract :
Resource-constrained data mining introduces many constraints when learning from large datasets. It is often not practical or possible to keep the entire data set in main memory and often the data could be observed in a single run in the order in which they are presented. Traditional reservoir-based approaches perform well in this situation. One drawback of these approaches is that the examples not included in the final reservoir are often ignored. To remedy this situation we propose a modification to the baseline reservoir algorithm. Instead of keeping the actual target values of reservoir examples, an estimate of their conditional expectation is kept and updated online as new data are observed from the stream. The estimate is obtained by averaging target values of the similar examples. The proposed algorithm uses a paired t-test to determine the similarity threshold. Thorough evaluation on generated two dimensional data shows that the proposed algorithm is producing reservoirs with considerably reduced target noise. This property allows training of significantly improved prediction models as compared with the baseline reservoir-based approach.
Keywords :
adaptive estimation; data mining; learning (artificial intelligence); sampling methods; statistical testing; baseline reservoir-based approach; conditional expectation adaptive estimation; learning algorithm; paired t-test; prediction model; reservoir sampling algorithm; resource-constrained data mining; similarity threshold; Adaptive estimation; Capacity planning; Data mining; Neural networks; Noise generators; Noise reduction; Predictive models; Reservoirs; Sampling methods;
Conference_Titel :
Neural Networks, 2007. IJCNN 2007. International Joint Conference on
Conference_Location :
Orlando, FL
Print_ISBN :
978-1-4244-1379-9
Electronic_ISBN :
1098-7576
DOI :
10.1109/IJCNN.2007.4371299