مرکز منطقه ای اطلاع رساني علوم و فناوري - Multi-purpose SNP Selection by the principal variables for a genetic study

Abstract :

In genome-wide association studies, the length of the single nucleotide polymorphisms (SNPs) has been drastically increased. The data may contain many near-duplicated SNPs in linkage equilibrium, which can cause difficulties in anaysis. It may also bring about many statistical problems in further analysis. Principal component analysis is a popular dimension reduction technique and is well known to be effective for many genetic association analyses. However, it is a linear combination of all the original variables, and does not provide direct interpretation about the original number of variables. The purpose of our study is to eliminate the redundant SNPs and select a smaller subset made of only the informative SNPs. We propose an unsupervised SNP selection algorithm based on the principal variable (PV) method. It achives the dimensionality reduction by selecting a subset of original variables called PVs that preserve as much information as possible. To find an optimal subset of SNPs, we focus on the criterion which minimizes the squared norm of the partial covariance matrix. We define principal component cluster by principal component analysis and choose the representative SNP with high loadings on important principal component on average. After discarding other SNPs in the PC cluster, we calculate the partial covariance matrix for the remaining variables given principal variable. To obtain the next representative SNP, the same procedure is iterated to the partial covariance matrix. The process repeats until there´s no more variable to select or to meet some stopping criterion, the percentage variance in terms of trace or squared norm of the covariance matrix. The resulting subset of SNPs could be used for further analysis with multiple purposes such as gene-gene interactions. We illustrate the proposed method by real genotype data and compare its performance with five current selection methods for principal variables.