Abstract :
The problem of adaptive noisy clustering is investigated. Given a set of noisy observations Zi = Xi + εi, i = 1,⋯,n, the goal is to design clusters associated with the law of Xi´s, with unknown density f with respect to the Lebesgue measure. Since we observe a corrupted sample, a direct approach as the popular k-means is not suitable in this case. In this paper, we propose a noisy k-means minimization, which is based on the k-means loss function and a deconvolution estimator of the density f. In particular, this approach suffers from the dependence on a bandwidth involved in the deconvolution kernel. Fast rates of convergence for the excess risk are proposed for a particular choice of the bandwidth, which depends on the smoothness of the density f. Then, we turn out into the main issue of this paper: the data-driven choice of the bandwidth. We state an adaptive upper bound using a modified version of Lespki´s method, called Empirical Risk Comparison, where empirical risks associated with different bandwidths are compared. Eventually, we illustrate that the selection rule can be used in many statistical problems of M-estimation where the empirical risk depends on a nuisance parameter.
Keywords :
adaptive estimation; deconvolution; minimisation; nonparametric statistics; pattern clustering; Lebesgue measure; Lespki method; M-estimation; adaptive noisy clustering; adaptive upper bound; data-driven choice; deconvolution estimator; deconvolution kernel; empirical risk comparison; excess risk; k-means loss function; noisy k-means minimization; noisy observations; selection rule; statistical problems; Bandwidth; Convergence; Deconvolution; Estimation; Kernel; Noise measurement; Standards; Adaptivity; M-estimation; deconvolution; errors-in-variables; fast rates; statistical learning;