TY - JOUR
T1 - SNP variable selection by generalized graph domination
AU - Sun, Shuzhen
AU - Miao, Zhuqi
AU - Ratcliffe, Blaise
AU - Campbell, Polly
AU - Pasch, Bret
AU - El-Kassaby, Yousry A.
AU - Balasundaram, Balabhaskar
AU - Chen, Charles
N1 - Funding Information:
This research is funded by Oklahoma Wheat Research Foundation, OCAST (PS15-011) and NSF-MRI 1626257 (CC), NSF-IOS 1558109 (CC and PC), NSF-CMMI 1404971 (BB), and a fellowship from the Cornell Lab of Ornithology (BP). The work presented in this report also reflects the support from the USDA HATCH project OKL03011 (CC). SS, BR, YAE and CC acknowledge cash funding for this research from Genome Canada, Genome Alberta through Alberta Economic Trade and Development, Genome British Columbia, the University of Alberta and University of Calgary and others, including the Alberta forest industry in support of the Resilient Forests (RES-FOR): Climate, Pests & Policy- Genomic Applications project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Publisher Copyright:
Copyright: © 2019 Sun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2019/1
Y1 - 2019/1
N2 - Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).
AB - Background High-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the pn problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models. Methods and findings K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum k-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength of k-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi optimization solver for the k-dominating set variable selection is available (https://github.com/ transgenomicsosu/SNP-SELECT).
UR - http://www.scopus.com/inward/record.url?scp=85060495599&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85060495599&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0203242
DO - 10.1371/journal.pone.0203242
M3 - Article
C2 - 30677030
AN - SCOPUS:85060495599
SN - 1932-6203
VL - 14
JO - PLoS ONE
JF - PLoS ONE
IS - 1
M1 - e0203242
ER -