DocumentCode :
3714569
Title :
Multi-purpose SNP Selection by the principal variables for a genetic study
Author :
Seunghyun Lee; Mira Park
Author_Institution :
Data analysis team, NEXON Korea Corporation, Seoul, Republic of Korea
fYear :
2015
Firstpage :
1341
Lastpage :
1344
Abstract :
In genome-wide association studies, the length of the single nucleotide polymorphisms (SNPs) has been drastically increased. The data may contain many near-duplicated SNPs in linkage equilibrium, which can cause difficulties in anaysis. It may also bring about many statistical problems in further analysis. Principal component analysis is a popular dimension reduction technique and is well known to be effective for many genetic association analyses. However, it is a linear combination of all the original variables, and does not provide direct interpretation about the original number of variables. The purpose of our study is to eliminate the redundant SNPs and select a smaller subset made of only the informative SNPs. We propose an unsupervised SNP selection algorithm based on the principal variable (PV) method. It achives the dimensionality reduction by selecting a subset of original variables called PVs that preserve as much information as possible. To find an optimal subset of SNPs, we focus on the criterion which minimizes the squared norm of the partial covariance matrix. We define principal component cluster by principal component analysis and choose the representative SNP with high loadings on important principal component on average. After discarding other SNPs in the PC cluster, we calculate the partial covariance matrix for the remaining variables given principal variable. To obtain the next representative SNP, the same procedure is iterated to the partial covariance matrix. The process repeats until there´s no more variable to select or to meet some stopping criterion, the percentage variance in terms of trace or squared norm of the covariance matrix. The resulting subset of SNPs could be used for further analysis with multiple purposes such as gene-gene interactions. We illustrate the proposed method by real genotype data and compare its performance with five current selection methods for principal variables.
Keywords :
"Genomics","Bioinformatics","Computational modeling","Load modeling","Sun"
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/BIBM.2015.7359873
Filename :
7359873
Link To Document :
بازگشت