C.R. Weinberg, M. Shi, D.M. Umbach

Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC

Background: To evaluate statistical methods for identification of sets of genetic variants related to risk of disease, one first needs to be able to simulate realistic genotypes. We developed a method, applicable to a broad range of association study designs, that can simulate autosome-wide single-nucleotide polymorphism data with realistic linkage disequilibrium and with spiked in, user-specified, single or multi-SNP causal effects. We are using the simulated data to develop and improve a method based on a stochastic search algorithm to identify multi-SNP effects on risk.

Results: Our construction uses existing genome-wide association data from unrelated case-parent triads, augmented by including a hypothetical complement triad for each triad (same parents but with a hypothetical offspring who carries the non-transmitted parental alleles). We assign offspring qualitative or quantitative traits probabilistically through a specified risk model and show that our approach destroys all risk signals in the original data. Our method can simulate genetically homogeneous or stratified populations and can simulate case-parents studies, case-control studies, case-only studies, or studies of quantitative traits. We show that allele frequencies and linkage disequilibrium structure in the original genome-wide association sample are preserved in the simulated data. We have implemented our method in an R package (TriadSim) which is freely available at the comprehensive R archive network. The stochastic search genetic algorithm we apply to case-parents data can evidently find a very small needle in the very large haystack comprising all possible epistatistic sets, provided there is little linkage disequilibrium between the SNPs.

Conclusion: We have proposed a method for simulating genome-wide SNP data with realistic linkage disequilibrium. Our method will be useful for developing statistical methods for studying genetic associations, including higher order effects like epistasis and gene by environment interactions. While the simulation tool is an important step, the search for epistasis in the presence of dependencies will remain a challenge.