DocumentCode
549188
Title
A local dependence measure and its application to screening for high correlations in large data sets
Author
Sricharan, Kumar ; Hero, Alfred O., III ; Rajaratnam, Bala
Author_Institution
Dept. of EECS, Univ. of Michigan, Ann Arbor, MI, USA
fYear
2011
fDate
5-8 July 2011
Firstpage
1
Lastpage
8
Abstract
Correlation screening is frequently the only practical way to discover dependencies in very high dimensional data. In correlation screening a high threshold is applied to the matrix of sample correlation coefficients of the multivariate data. The variables having coefficients that exceed the threshold are called discoveries and are classified to be dependent. The mean number of discoveries and the number of false discoveries in correlation screening problems depend on a information-theoretic measure J, a novel type of information divergence that is a function of the joint density of pairs of variables. It is therefore important to estimate J in order to determine screening thresholds for desired false alarm rates. In this paper, we propose a kernel estimator for J, establish asymptotic consistency and determine the asymptotic distribution of the estimator. These results are used to minimize the MSE of the estimator and to determine confidence intervals on J. We use these results to test for dependence between variables in both simulated data sets and also between email spam harvesters. Finally, we use the estimate of J to determine screening thresholds in correlation screening problems involving gene expression data.
Keywords
correlation methods; data analysis; estimation theory; genetic algorithms; information theory; mean square error methods; MSE; asymptotic consistency; asymptotic distribution; confidence intervals; correlation screening problems; desired false alarm rates; dimensional data; email spam harvesters; false discovery; gene expression data; information divergence; information-theoretic measure; joint density; kernel estimator; large data sets; local dependence measure; multivariate data; sample correlation coefficients; screening thresholds; simulated data sets; Correlation; Covariance matrix; Electronic mail; Estimation; Gaussian distribution; Joints; Random variables; CLT; Dependence measure; Information theory; correlation screening; estimation;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on
Conference_Location
Chicago, IL
Print_ISBN
978-1-4577-0267-9
Type
conf
Filename
5977629
Link To Document