Author :
Kontonasios, Kleanthis-Nikolaos ; Vreeken, Jilles ; De Bie, Tijl
Abstract :
Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation.
Keywords :
data mining; maximum entropy methods; optimisation; probability; statistical analysis; MDL based histogram estimator; convex optimization technique; data mining; knowledge discovery process; marginal distribution; maximum entropy modelling; real-valued rectangular database; statistical assessment; Data mining; Data models; Databases; Entropy; Histograms; Probabilistic logic; Testing; Maximum Entropy modelling; background knowledge; hypothesis testing; swap randomizations;