Implementing a parallel matrix factorization library on the cell broadband engine

doi:10.1155/2009/710321

Open AccessJournal Article10.1155/2009/710321

Implementing a parallel matrix factorization library on the cell broadband engine

B. C. Vishwas, +2 more

- 01 Jan 2009

- Vol. 17, Iss: 1, pp 3-29

10

TL;DR: A thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine and explains how to achieve near-linear speedup on most of the factorization methods for a range of matrix sizes.

Abstract: Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning The central contribution of this paper is a thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine The paper explores algorithmic as well as implementation challenges related to the Cell chip-multiprocessor and explains how we achieve near-linear speedup on most of the factorization techniques for a range of matrix sizes For each of the factorization routines, we identify the bottleneck kernels and explain how we have attempted to resolve the bottleneck and to what extent we have been successful Our implementations, for the largest data sets that we use, running on a two-node 32 GHz Cell BladeCenter (exercising a total of sixteen SPEs), on average, deliver 2039, 2846, 815, 2439 and 540 GFLOPS for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, respectively The implementations achieve speedup of 112, 128, 106, 130 and 62, respectively for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, when running on sixteen SPEs We discuss the interesting interactions that result from parallelization of the factorization routines on a two-node non-uniform memory access (NUMA) Cell Blade cluster

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/J.PARCO.2011.08.006

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Roman Wyrzykowski, +2 more

- 01 Apr 2012

TL;DR: A model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors is proposed, followed by a series of systematic optimization steps.

...read moreread less

13

•Journal Article•10.1016/J.CAM.2011.07.023

Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Panagiotis D. Michailidis, +1 more

- 01 Sep 2011

- Journal of Computational and Applied Mat...

TL;DR: This paper describes and analyzes three parallel versions of the dense direct methods that are used in linear system solving on a multicore using an OpenMP interface, and proposes an implementation of the pipelining technique in OpenMP.

...read moreread less

10

Patent

Memory-aware matrix factorization

Liangliang Cao, +2 more

- 18 Jun 2015

TL;DR: Memory-aware matrix factorization on a graphics processing unit is discussed in this article, where the authors present a method, systems and computer program products for performing memory-aware matrices factorization.

...read moreread less

3

•Posted Content•10.1101/2020.06.07.138271

scASK: A novel ensemble framework for classifying cell types based on scRNA-seq data

Bo Liu, +2 more

- 08 Jun 2020

- bioRxiv

TL;DR: In this article, the authors combined high-resolution atomic force microscopy (AFM) imaging and force spectroscopy to gain insight into the interaction forces between the individual protomers of the hexagonally packed intermediate (HPI) layer of Deinococcus radiodurans.

...read moreread less

1

Journal Article•10.1007/S10287-012-0137-3

IPM based sparse LP solver on a heterogeneous processor

Mujahed Eleyat, +1 more

- 18 Jan 2012

- Computational Management Science

TL;DR: The block fan-out algorithm is extended such that small blocks are aggregated into large blocks without adding extra zeros to reduce the overhead of small blocks, and the suggested block aggregation method is able to speedup the whole LP solver of up to 2.5.

...read moreread less

1

References

Proceedings Article•10.1145/223982.223990

The SPLASH-2 programs: characterization and methodological considerations

Steven Cameron Woo, +4 more

- 01 May 1995

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.

...read moreread less

4.1K

Journal Article•10.1147/RD.494.0589

Introduction to the cell multiprocessor

J. A. Kahle, +5 more

- 01 Jul 2005

- Ibm Journal of Research and Development

TL;DR: This paper discusses the history of the project, the program objectives and challenges, the disign concept, the architecture and programming models, and the implementation of the Cell multiprocessor.

...read moreread less

1.1K

Journal Article•10.1109/MM.2006.41

Synergistic Processing in Cell's Multicore Architecture

Michael K. Gschwind, +5 more

- 01 Mar 2006

- IEEE Micro

TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.

...read moreread less

463

•Journal Article•10.1287/OPRE.50.1.3.17780

Solving Real-World Linear Programs: A Decade and More of Progress

Robert E. Bixby

- 01 Jan 2002

- Operations Research

TL;DR: One person's perspective on the development of computational tools for linear programming is described, followed by historical remarks covering the some 40 years of linear-programming developments that predate my own involvement in this subject.

...read moreread less

414

•Journal Article•10.1016/S0024-3795(99)00134-2

Computing the Singular Value Decomposition with High Relative Accuracy

James Demmel, +5 more

- 01 Feb 1997

- Linear Algebra and its Applications

TL;DR: This paper analyzes when it is possible to compute the singular values and singular vectors of a matrix with high relative accuracy, which means that each computed singular value is guaranteed to have some correct digits, even if the singularvalues have widely varying magnitudes.

...read moreread less

220

...

Expand

Implementing a parallel matrix factorization library on the cell broadband engine

Chat with Paper

AI Agents for this Paper

Citations

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Memory-aware matrix factorization

scASK: A novel ensemble framework for classifying cell types based on scRNA-seq data

IPM based sparse LP solver on a heterogeneous processor

References

The SPLASH-2 programs: characterization and methodological considerations

Introduction to the cell multiprocessor

Synergistic Processing in Cell's Multicore Architecture

Solving Real-World Linear Programs: A Decade and More of Progress

Computing the Singular Value Decomposition with High Relative Accuracy

Related Papers (5)

On mapping data and computation for parallel sparse Cholesky factorization

A Scalable Parallel Block Algorithm for Band Cholesky Factorization.

Parallel sparse Cholesky factorization on a shared memory multiprocessor

QR factorization for the Cell Broadband Engine

An efficient block-oriented approach to parallel sparse Cholesky factorization