Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

Open AccessPosted Content

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-Core and Many-Core Systems

- 14 May 2019

- arXiv: Distributed, Parallel, and Cluste...

2

TL;DR: In this article, target-independent optimizations were proposed to optimize sparse matrix-vector multiplication (SVMV) operations on both CPU and GPU. But the performance of the SpMV operation often depends on exploiting regularity patterns in the matrix.

Abstract: Sparse matrix-vector multiplication (SpMV) operations are commonly used in various scientific applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and GPUs. Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The LiFE application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD), a widely used tensor decomposition method. Using this decomposition leads to irregular array references, making it very difficult to optimize for both CPUs and GPUs. Recent codes of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this work, we first propose target-independent optimizations to optimize these SpMV operations, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations, (2) data restructuring methods, and (3) methods to partition computations among threads. Then we present the optimizations for CPUs and GPUs to exploit platform-specific speed. Our highly optimized CPU code obtain a speedup of 27.12x over the original sequential CPU code running on 16-core Intel Xeon (Skylake-based) system, and our optimized GPU code achieves a speedup of 5.2x over a reference optimized GPU code version on NVIDIA's GeForce RTX 2080 Ti GPU.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

Proceedings Article•10.1145/3330345.3332469

Optimizing the linear fascicle evaluation algorithm for many-core systems

Karan Aggarwal, +1 more

- 26 Jun 2019

TL;DR: Data restructuring techniques to minimize the effects of irregular accesses are proposed, various optimizations to optimally map threads at the granularity of warps, thread blocks and grid, and methods to partition the computation among thread blocks to obtain fine-grained parallelism and data reuse are proposed.

...read moreread less

7

References

Journal Article•10.1137/07070111X

Tensor Decompositions and Applications

Tamara G. Kolda, +1 more

- 01 Aug 2009

- Siam Review

TL;DR: This survey provides an overview of higher-order tensor decompositions, their applications, and available software.

...read moreread less

11.5K

Journal Article•10.1007/BF02289464

Some mathematical notes on three-mode factor analysis

Ledyard R Tucker

- 01 Sep 1966

- Psychometrika

TL;DR: The model for three-mode factor analysis is discussed in terms of newer applications of mathematical processes including a type of matrix process termed the Kronecker product and the definition of combination variables.

...read moreread less

4.5K

Journal Article•10.1002/1531-8249(199902)45:2<265::AID-ANA21>3.0.CO;2-3

Three-dimensional tracking of axonal projections in the brain by magnetic resonance imaging.

Susumu Mori, +3 more

- 01 Feb 1999

- Annals of Neurology

TL;DR: It is shown that neuronal pathways in the rat brain can be probed in situ using high‐resolution three‐dimensional diffusion magnetic resonance imaging and a newly designed tracking approach.

...read moreread less

3.7K

•Journal Article•10.1371/JOURNAL.PCBI.0010042

The Human Connectome: A Structural Description of the Human Brain

Olaf Sporns, +2 more

- 30 Sep 2005

- PLOS Computational Biology

TL;DR: A research strategy to achieve the connection matrix of the human brain (the human “connectome”) is proposed, and its potential impact is discussed.

...read moreread less

3.4K

•Journal Article•10.1002/1522-2594(200010)44:4<625::AID-MRM17>3.0.CO;2-O

In vivo fiber tractography using DT-MRI data

Peter J. Basser, +4 more

- 01 Oct 2000

- Magnetic Resonance in Medicine

TL;DR: Fiber tract trajectories in coherently organized brain white matter pathways were computed from in vivo diffusion tensor magnetic resonance imaging (DT‐MRI) data, and the method holds promise for elucidating architectural features in other fibrous tissues and ordered media.

...read moreread less

3.3K