Statistical Haplotypes Based on Functional Sequence Data Analysis for Genome-Wide Association Studies

Question

1. What is FDA's role in GWAS?

2. How are SNPs coded in GWAS?

3. How to determine statistically significant SNPs in functional regression models?

4. How are genetic-based haplotypes constructed?

Accepted Answer

Functional data analysis (FDA) plays a critical role in genome-wide association studies (GWAS) by incorporating correlation, linkage, and linkage disequilibrium information of genetic variants into association tests. FDA captures complex dependency structures and higher order linkage disequilibrium among genetic variants, which are often missed by other methods. It also provides a computationally efficient way to test associations between multiple variants and phenotypes. FDA's ability to naturally incorporate genetic variant data as a collection of random variables forming a stochastic process makes it suitable for large-scale genomic data analysis. FDA has been proven effective in GWAS, as demonstrated by Fan et al. and subsequent studies. The objective of this paper is to propose a robust approach to represent genetic variants using FDA, addressing the normality assumption violation in least-square smoothing. The paper introduces a two-step method integrating curve smoothing with functional generalized linear models to identify haplotypes associated with the phenotype. This approach is detailed in Section 2, and its performance is compared with single SNP test results and haplotypes constructed based on genetic information in Section 3. The strengths and limitations of the approach are discussed in Section 4, and future directions are highlighted in Section 5. Overall, FDA's role in GWAS is to provide a comprehensive and efficient method for analyzing complex genetic data and identifying associations between genetic variants and phenotypes.

Accepted Answer

In GWAS, SNPs are often coded as the number of minor alleles observed at a given locus. For a particular locus, there are two alleles on a homologous chromosome pair, and the SNPs can be coded as 0, 1, or 2. The nucleotide A, the more common allele, is referred to as the major allele or reference allele, while T, the less common allele, will be the minor allele or non-reference allele. Consequently, a genotype of AA is coded as 2, AT or TA as 1, and TT as 0. This coding system helps in analyzing the genetic variations and their association with diseases in GWAS studies.

Accepted Answer

In functional regression models, statistically significant SNPs can be found by examining the confidence band of b(t), which connects the point-wise confidence intervals of b(t) at each position t. The confidence band is calculated using the formula bT ph(t) +- z a/2 q (b) k=1 q (b) l=1 ph k (t)ph l (t)Cov(b k , b l ). Here, ph(t) represents a set of basis functions for b(t), and b is a vector of coefficients corresponding to the basis functions. By analyzing the confidence band, researchers can identify SNPs that have a statistically significant association with the phenotype or disease being studied.

Accepted Answer

Genetic-based haplotypes are constructed using two common approaches: (1) blocks containing a fixed number of SNPs and (2) fixed genomic windows of size. In these approaches, blocks with 5, 10, 20, and 50 SNPs and window sizes of 10 kb, 20 kb, 50 kb, and 100 kb are used. This results in 1392 blocks for Approach (1) and 2394 blocks for Approach (2). Missing values in genotype data are handled by excluding samples with missing values for any SNP in a haplotype and imputing missing values using the mode. Significant blocks are identified using a GLM model, with a p-value threshold of 10^-3. Approach (1) found 9 significant blocks, while Approach (2) found 11 significant blocks, primarily located around 22 Mb with a few at 24 Mb. These findings align with single SNP tests, which identified significant SNPs near 22 Mb. Haplotypes at 24 Mb have larger p-values (>10^-4) and are considered noise rather than true associations.

Accepted Answer

In the two-step approach, FDA-based haplotypes are constructed based on functional GLM coefficients. The log-odds curves g(t) are smoothed using 64 cubic spline basis functions to approximate the genotype, treating SNP positions as a continuum t. The curves are then transformed back to probability curves and fitted using a functional logistic model as described in Equation (6). AIC is used to select the appropriate number of basis functions for b(t), resulting in q (b) = 4 cubic spline basis functions with AIC = 1440.4. Confidence bands of the estimated coefficient curve are displayed in Figure 1. However, small signals in SNPs may be overlooked, so the SNPs are divided into four chunks with an equal window size of approximately 3.3 Mb per chunk. Each chunk has a different number of basis functions for SNPs (q (SNP)) and b(t), chosen based on AIC. The final estimated coefficient curves and their confidence bands for all four chunks are presented in Figure 2, showing strong associations of 68 SNPs in the range of 21986218 to 22219365 at the 5% significance level. Some weak associations are also observed at the 5% significance level but not at 1%, considered as noise rather than signals.

Accepted Answer

In the 9p21.3 region, significant SNPs and haplotypes were identified using three methods. The SNPs with the smallest p-values were located between 22.06 Mb and 22.13 Mb. The strongly associated haplotypes in genetic haplotypes were found at 22.06 Mb to 22.20 Mb. In FDA haplotypes, the most significant haplotype was found around 21.98 Mb to 22.22 Mb. These findings validate the association between SNPs in this region and CAD, as two genes, CDKN2A and CDKN2B, have been implicated in conferring risk for CAD in multiple studies.

Accepted Answer

The novel method improves haplotype identification by efficiently identifying important SNPs without exhaustive search, incorporating spatial information and correlation among SNPs into regression models. It also imputes missing SNP values during smoothing, preventing biased results. The method considers data as binomial probability curves, enhancing interpretability and resolving normality assumption violations. This approach successfully retrieved significant regions in SNPs with low false-positive rates, as demonstrated using the PennCATH dataset. Both functional and non-functional methods were effective, but the functional method's ability to incorporate spatial information and correlation among SNPs gives it an advantage. Overall, the novel method enhances the accuracy and reliability of haplotype identification in GWAS studies.

Statistical Haplotypes Based on Functional Sequence Data Analysis for Genome-Wide Association Studies

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is FDA's role in GWAS?

2. How are SNPs coded in GWAS?

3. How to determine statistically significant SNPs in functional regression models?

4. How are genetic-based haplotypes constructed?

5. How is the FDA-based haplotype constructed?

6. What SNPs and haplotypes were identified in the 9p21.3 region?

7. How does the novel method improve haplotype identification?

References

Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test

A common allele on chromosome 9 associated with coronary heart disease

A common allele on chromosome 9 associated with coronary heart disease

Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies.

Genome-wide association studies

Related Papers (5)

Identifying disease related sub-pathways for analysis of genome-wide association studies

Association of regions on chromosomes 6 and 7 with blood pressure in Nigerian families.

Whole-genome re-sequencing association study for direct genetic effects and social genetic effects of six growth traits in Large White pigs.

Selecting Closely-Linked SNPs Based on Local Epistatic Effects for Haplotype Construction Improves Power of Association Mapping

Genome-wide haplotype association study identifies the FRMD4A gene as a risk locus for Alzheimer's disease.