Author :
Bellavia, Stefania ; Lodi, Stefano ; Morini, Benedetta
Abstract :
Kernel density estimators are a popular family of non-parametric estimators with applications to exploratory statistics and data mining. Since kernel estimators must be constructed from the data, if the data are sensitive, only indirect representations of the estimate, such as graphs or tabulations, can be stored or transmitted. However, even such representations might contain enough information to allow for data reconstruction, yielding an inference problem for kernel estimates. The inference problem for kernel estimators can be described by a system of nonlinear equations that arises naturally from the kernel estimate of a multi-variate dataset. The solution to the system is the set of data from which the kernel estimate was computed and, in practice a good approximation to the solution is not available. A serious threat to data privacy is posed by publicly available solvers for nonlinear systems. This paper investigates the numerical solution of the nonlinear systems arising from the kernel estimate of a multivariate dataset and shows that this task is challenging. In fact, the Jacobian matrix of the system is numerically singular and a large number of solvers for nonlinear equations fails as they have to solve linear systems whose coefficient matrix is given by the Jacobian. Further, up to date solvers for optimization problems that do not suffer from this drawback may fail to solve the nonlinear system. To show this fact, we tested a subspace trust-region method, a BFGS method and a gradient projection method on both a synthetic and a real dataset. These methods are able to find a solution to the optimization problem even starting far from it. However, the experimental results on both the synthetic and the real dataset show that, if the initial guess is not very close to the solution, all three methods fail to converge to a solution of the system of equations. Then, unless a very good approximation of the solution is known, the dataset cannot be reconstructed b- - y using publicly available solvers
Keywords :
Jacobian matrices; data mining; data privacy; estimation theory; gradient methods; inference mechanisms; nonlinear equations; optimisation; pattern clustering; probability; statistical databases; Jacobian matrix; data mining; data privacy; data reconstruction; exploratory statistics; gradient projection method; inference problem; kernel density estimates; multivariate dataset; nonlinear equation; nonlinear system solving; optimization problem; subspace trust-region method; Data mining; Data privacy; Jacobian matrices; Kernel; Linear systems; Nonlinear equations; Nonlinear systems; Statistics; Testing; Yield estimation;