Accelerated computations for iterative-solver techniques in single-step BLUP models

Question

1. How can computational strategies be used to solve mixed model equations efficiently in large animal breeding programs?

2. What is the ssSNPBLUP approach and how is it represented in the equation system?

3. What computational bottleneck exists in equation systems?

4. How does the 5codes algorithm reduce data streams?

Accepted Answer

Computational strategies can be used to solve mixed model equations efficiently in large animal breeding programs by utilizing high-performance computing (HPC) techniques. One proposed approach is the algorithm for Proven and Young Animals (APY), which approximates the inverse of the genomic relationship matrix (GRM) through genomic recursion on a subset of core animals. The original ssGBLUP model was reformulated to allow the use of established numerical software and avoid the explicit construction and inversion of the GRM. Single-step GT(A)BLUP models and single-step SNP BLUP models were proposed to estimate SNP effects directly, avoiding the GRM and its inverse. The use of sparse matrix operations and iterative-solver algorithms, such as the preconditioned conjugate gradient (PCG), can improve convergence speed and reduce computational load. Tailored algorithms for the multiplication of SNP matrices by real-valued matrices have been developed for CPUs and Nvidia (r) GPUs, utilizing the Nvidia (r) CUTLASS library and optimized for various instruction set architectures. These advancements can significantly reduce computation times and memory requirements, enabling the inclusion of larger population sizes in genomic evaluations.

Accepted Answer

The ssSNPBLUP approach is a single-step model introduced by Liu et al. (2014) for estimating breeding values. It is represented in the equation system as X'R^-1XX'R^-1nR^-1nWnX'R^-1gR^-1gWgW'nR^-1nXnW'nR^-1nWn + S11S12W'gR^-1gXgS21W'gR^-1gWg + S22S23S33-1S32.. The equation system consists of matrices and vectors that relate records, genotypes, and effects to estimate breeding values. The ssSNPBLUP system of equations is used to estimate the breeding values of animals based on their genotypes and phenotypes, considering the additive genetic effects and residual polygenic effects. The equation system allows for the estimation of breeding values in a single-step model, making it computationally efficient for large datasets.

Accepted Answer

The multiplication of Z by a matrix of low width L has been a computational bottleneck. This operation can be reformulated into ZL = ML - 1 ng p ' L, which consists of a vector-matrix multiplication and subsequent rank-one updates, making it cheap computationally. However, due to the low cost of genotyping, the matrix M can have extremely large dimensions, capturing the genomic information of millions of animals. Additionally, the matrix M is usually stored in compressed format, preventing naive calls to BLAS routines and making decompression inefficient and memory-intensive. Kim et al. (2022) propose a decompress-on-the-fly approach to address this issue, unpacking tiles of submatrices of M small enough to store the result in L1 cache and performing matrix multiplication on these tiles, taking advantage of the fast access times of the L1 cache.

Accepted Answer

The 5codes algorithm reduces data streams by utilizing a novel approach for CPU computations that aims to reduce data streams through the cache hierarchy. It views the problem of storing SNP data through the lens of combinatorics, where each realized vector can be stored in one 8-bit unsigned integer while preserving the order of the SNPs. During preprocessing, the input data is converted to a more compressed format called 5codes. At multiplication time, the algorithm loads a vector and stores all possible results of the scalar product in a hash table. This allows for efficient computation and storage of SNP data, reducing the number of memory accesses and accumulation errors in precision. The algorithm also parallelizes computations among available processor cores, further optimizing performance.

Accepted Answer

GPU implementation enhances HPC capabilities by leveraging the powerful Nvidia (r) GPUs. By implementing a matrix multiplication routine in CUDA, the CUTLASS library is extended for high-performance matrix operations. The Single Instruction Multiple Threads (SIMT) subsection within CUTLASS's warp-level API distributes scalar products of four-dimensional vectors to the GPU cores. The established PLINK 1 byte-sized storage format for SNP four-dimensional vectors is maintained, while a new CUTLASS-compatible interleaved data type for double-precision vectors of size four is introduced. A new scalarproduct microkernel is added to fit the genotype matrix multiplication into the CUTLASS framework, and CUTLASS interfaces are adjusted upstream accordingly. Efficient memory access iterators in CUTLASS allow for quick data movement between device memory, shared memory, and cores. Memory for the matrix L is preallocated to reduce allocations and data movement between host and device. This implementation is the first of its kind, performing matrix multiplication of 2-bit integers with double-precision floating-point values, designed in parallel to Kim et al.'s work on CUTLASS for 4-bit integers with half-precision floating-point values.

Accepted Answer

Memory-efficient implementation can transpose chunks of the Z matrix by iterating over low-dimensional chunks, such as 16 by 16, which are compatible with the compressed data storage format. This approach allows for efficient calculation of the transposed matrix product Z'L without being memory-bound. By transposing Z as a whole during start-up and storing it separately, computation time can be reduced. This method is particularly useful when dealing with large datasets, such as the 2.61 million animals with 47,000 SNP markers used in the article, which only requires about 57 gigabytes of random access memory.

Accepted Answer

The 5codes algorithm significantly outperforms the Intel MKL library in terms of computation time for genotype data. When evaluating the performance of consecutively multiplying ZL and Z'L, required in each iteration of a PCG solver, the 5codes algorithm showed a reduction in computation times by more than 99.7% (98.1%) for the small and medium population on the GPU (CPU). In contrast, the previous technique for multiplying compressed SNP matrices with phenotype data proposed by Vandenplas et al. (2020) took about 19 times (3 times) longer on all population sizes on the GPU (CPU). These results demonstrate the efficiency and scalability of the 5codes algorithm compared to the Intel MKL library.

Accepted Answer

The routine six-trait calving-difficulty evaluation assessed traits in Irish dairy and beef cattle. The evaluation was performed by the Irish Cattle Breeding Federation (ICBF) in March 2022. The traits included in the evaluation were not explicitly mentioned in the provided information. However, it can be inferred that the traits were related to calving difficulty, as the evaluation aimed to assess this aspect in cattle. The evaluation was based on a multi-trait animal model and variance components, as described in Evans et al. (2019) and Vandenplas et al. (2023). The data file included 16.59 million records across six traits, and the pedigree included 26.46 million animals. The genotypes of 2.61 million animals included 47,006 SNP markers from 29 bovine autosomes, with a minor allele frequency greater or equal to 0.01. The genotype data was from a range of 30 different arrays ranging from 3k to 850k SNPs. Missing SNP genotypes were imputed using FImpute (Sargolzaei et al., 2014) to a 50k SNP set based on version 3 of the IDB chip.

Accepted Answer

The average wall clock time reduction for ssSNPBLUP equation system on GPUs is approximately 72% (11%) for the multiplication ZL and 95% (76%) for Z'L. This significant reduction in time is due to the efficient use of GPUs for genotype matrix multiplication, which constitutes a significant portion of the time per iteration in the PCG solver. The total time for solving the ssSNPBLUP model was reduced from approximately 6.47 hours (4.03 hours) to 2.49 hours (1.98 hours) on the GPU and to 4.22 hours (3.30 hours) on the CPU.

Accepted Answer

Multiplying centered genotype matrices have applications in single-step BLUP models for computing breeding value estimates when only a subset of animals are genotyped and/or phenotyped. They are also used in genome-wide association studies for genotype-phenotype correlations, population stratification, and other computational bottlenecks. The optimized algorithms achieved a speed-up of critical operations by a factor of ca. 3 compared to previous methods using CPUs, and a factor of ca.20 using GPUs. This acceleration allows researchers and practitioners to estimate breeding values in large populations within a reasonable time frame. However, as breeding populations grow and genotyping costs decrease, further research is needed to utilize computing resources more efficiently, such as reducing system memory requirements and improving scalability with GPUs.

Accelerated computations for iterative-solver techniques in single-step BLUP models

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. How can computational strategies be used to solve mixed model equations efficiently in large animal breeding programs?

2. What is the ssSNPBLUP approach and how is it represented in the equation system?

3. What computational bottleneck exists in equation systems?

4. How does the 5codes algorithm reduce data streams?

5. How does GPU implementation enhance HPC capabilities?

6. How can memory-efficient implementation transpose chunks of Z matrix?

7. How does the 5codes algorithm compare to the Intel MKL library in terms of computation time for genotype data?

8. What traits were evaluated in the routine six-trait calving-difficulty assessment?

9. What is the average wall clock time reduction for ssSNPBLUP equation system on GPUs?

10. What are the applications of multiplying centered genotype matrices?

Citations

Accelerated matrix-vector multiplications for matrices involving genotype covariates with applications in genomic prediction

Comparison of genomic prediction accuracy using different models for egg production traits in Taiwan Country chicken

References

Second-generation PLINK: rising to the challenge of larger and richer datasets

Principal components analysis corrects for stratification in genome-wide association studies

GCTA: a tool for genome-wide complex trait analysis.

The Pfam protein families database: towards a more sustainable future

Efficient Methods to Compute Genomic Predictions

Related Papers (5)

Accelerating PCG power/ground network solver on GPGPU

A Lattice-Boltzmann solver for 3D fluid simulation on GPU

A Survey on Optimization and Parallelization of Conjugate Gradient Solver

Programming CUDA-based GPUs to simulate two-layer shallow water flows

CUDA-based Parallel Bi-Conjugate Gradient Matrix Solver for BioFET Simulation