Explainable multi-task learning improves the parallel estimation of polygenic risk scores for many diseases through shared genetic basis

Question

1. What is polygenic risk score?

2. How many diseases were included in the pandisease MTL model?

3. How do pan-cancer MTL and pan-disease MTL models compare to STL in predicting malignant melanoma PRS?

4. What algorithm identified important SNPs for MTL?

Accepted Answer

Polygenic risk score (PRS) quantifies an individual's genetic risk for a complex disease based on multiple genetic variants across their genome. PRS is estimated using statistical methods like BLUP and LDPred, and machine learning approaches. It assumes linear and independent effects of risk variants on a phenotype. Cross-trait analyses reveal shared genetic determinants among diseases, which can improve PRS estimation using multi-task learning (MTL) approaches.

Accepted Answer

In the pandisease MTL model, 77 diseases were included, including 17 types of cancers and 60 non-cancer diseases. These diseases had prevalence levels higher than 0.5% in the UKB cohort. The model was constructed to predict all 362 diseases using the United Kingdom Biobank dataset, which contained 805,426 SNPs genotyped in the cohort. The dataset was divided into training, validation, and test sets for model training, hyperparameter optimization, and performance benchmarking, respectively.

Accepted Answer

The pan-cancer MTL and pan-disease MTL models showed better predictive performance than the STL model for malignant melanoma PRS. The pan-cancer MTL model achieved a 9.2% higher AUC and the pan-disease MTL model achieved an 8.1% higher AUC compared to the STL model. Both MTL models also achieved higher precision at the same recall level. The pan-cancer MTL model had a 141% relative increase in ROC AUC and a 96% relative increase in PR AUC over the STL model. The pan-disease MTL model had a 153% relative increase in ROC AUC and an 83% relative increase in PR AUC over the STL model. These results suggest that MTL models offer significant improvements in predicting malignant melanoma PRS compared to the STL model.

Accepted Answer

The first-order model-wise LINA interpretation algorithm [26] was used to identify the important SNPs used by MTL to predict each disease. It trained and interpreted a pan-cancer MTL model using a whole-genome vector containing real SNPs and decoy SNPs. The algorithm identified 48 real SNPs as important for predicting malignant melanoma at an estimated FDR level of 0.1%. Many of these important SNPs have been identified as risk variants for melanoma in previous GWAS studies. The algorithm also identified important SNPs for the 17 prevalent cancers at FDR levels of 0.1% and 5%.

Accepted Answer

Positive knowledge transfer in Multi-Task Learning (MTL) refers to the phenomenon where aggregating multiple tasks together in a neural network model leads to improved performance for those tasks. In the context of genetic risk prediction (PRS) estimation, positive knowledge transfer occurs when training a model on a diverse set of diseases results in enhanced predictive performance for each disease. This transfer learning effect is observed when the model benefits from the shared genetic basis and latent representations across different diseases. The extent of positive knowledge transfer is quantified by the gain in predictive performance achieved through MTL compared to Single Task Learning (STL). For example, in the provided study, many prevalent cancers and non-cancer diseases experienced increased PRS accuracies when trained using MTL with a larger set of diseases. This suggests that the joint selection of important SNPs and the amplification of data through pooling multiple diseases contribute to the positive knowledge transfer in PRS estimation.

Accepted Answer

The protocol and software used for phenotypic trait determination were described in a previous study [42]. This study utilized the UK Biobank dataset release version 2, which included 488,175 subjects. The phenotypic traits of these subjects were determined using the established protocol and software. The dataset covered a total of 805,426 SNPs, and the genotypes of these SNPs were encoded as 0 for homozygous with the minor allele, 1 for heterozygous alleles, or 2 for homozygous with the dominant allele. The data processing, model training, performance benchmarking, and model interpretation code is available publicly at https://github.com/thepanlab/GattacaNet2.

Accepted Answer

The input of MTL LINA is an m x 1 vector, X, containing the genotypes of m SNPs. The output is a d x 1 vector, Y, containing the predicted states of d traits. In the pan-cancer MTL model, d = 69, and in the pan-disease MTL model, d = 362. The model uses a sigmoid activation function, coefficient matrix K, attention vector A, and bias vector B to compute the output. A feedforward neural network, F(), with three hidden layers and specific activation functions, is used to compute the attention vector A. The loss function is defined as a combination of weighted cross-entropy losses and L2 norm of the coefficient matrix.

Accepted Answer

MTL and STL models were trained using stochastic gradient descent on a training set comprising 70% of UKB subjects. The training utilized mini-batches with a batch size of 512 and the Adam optimizer, starting with an initial learning rate of 10^-4. The models were trained for 100 epochs, with checkpointing after every epoch. The best-performing checkpoints on the validation set were retained for both pan-cancer and pan-disease MTL models. The training was conducted on a computer node equipped with dual A100 40GB GPUs and 256 GB system memory. Lazy-loading of training data was implemented using the pandas_plink library to minimize memory usage. Post-training, the predictive performance of all MTL and STL models was benchmarked using a test set.

Accepted Answer

The false discovery rate (FDR) is estimated by comparing the importance scores of real and decoy SNPs. Decoy SNPs are randomly set to have identical frequencies of alleles as real SNPs but should have no association with phenotypes. The FDR for an importance score threshold is calculated as the ratio between the numbers of decoy SNPs above this threshold and real SNPs. This estimation helps identify false positive hits in the interpretation algorithm. The FDR is estimated at 0.1% and 5% thresholds for important SNPs in cancers with >0.5% prevalence in the pan-cancer MTL model. The intersection and union of important SNPs between pairs of prevalent cancers are counted, and the genetic correlation between two cancers is computed using the Spearman correlation coefficient between the importance scores of SNPs in the union of their SNP sets at a 5% FDR.

Explainable multi-task learning improves the parallel estimation of polygenic risk scores for many diseases through shared genetic basis

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What is polygenic risk score?

2. How many diseases were included in the pandisease MTL model?

3. How do pan-cancer MTL and pan-disease MTL models compare to STL in predicting malignant melanoma PRS?

4. What algorithm identified important SNPs for MTL?

5. What is positive knowledge transfer in MTL?

6. What protocol and software were used for phenotypic trait determination?

7. What is the input and output of MTL LINA?

8. How were MTL and STL models trained?

9. How is the false discovery rate estimated in model interpretation?

Citations

Enhancing Epidural Needle Guidance using a Polarization-Sensitive Optical Coherence Tomography Probe with Convolutional Neural Networks.

Automatic renal carcinoma biopsy guidance using forward-viewing endoscopic optical coherence tomography and deep learning

A comprehensive multi-task deep learning approach for predicting metabolic syndrome with genetic, nutritional, and clinical data

Interpreting artificial neural networks to detect genome-wide association signals for complex traits

Automatic renal carcinoma biopsy guidance using forward-viewing endoscopic optical coherence tomography and deep learning

References

Drosophila Double parked: a conserved, essential replication protein that colocalizes with the origin recognition complex and links DNA replication with mitosis and the down-regulation of S phase transcripts

Identifying beneficial task relations for multi-task learning in deep neural networks

Incidence of cancer after asthma development: 2 independent population-based cohort studies.

Related Papers (5)

Physician-Friendly Machine Learning: A Case Study with Cardiovascular Disease Risk Prediction

Introduction to supervised machine learning in clinical epidemiology

Breakdown of Machine Learning Algorithms

Gene-Gene interactions and pleiotropy in the brain nicotinic pathway associated with the heaviness and precocity of tobacco smoking among outpatients with multiple substance use disorders

Expression levels of RuvBL2 during ovarian development and association between its single nucleotide polymorphism (SNP) and growth of the giant tiger shrimp Penaeus monodon