Implementing a parallel matrix factorization library on the cell broadband engine
B. C. Vishwas,Abhishek Gadia,Mainak Chaudhuri +2 more
- 01 Jan 2009
- Vol. 17, Iss: 1, pp 3-29
TL;DR: A thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine and explains how to achieve near-linear speedup on most of the factorization methods for a range of matrix sizes.
read more
Abstract: Matrix factorization (or often called decomposition) is a frequently used kernel in a large number of applications ranging from linear solvers to data clustering and machine learning The central contribution of this paper is a thorough performance study of four popular matrix factorization techniques, namely, LU, Cholesky, QR and SVD on the STI Cell broadband engine The paper explores algorithmic as well as implementation challenges related to the Cell chip-multiprocessor and explains how we achieve near-linear speedup on most of the factorization techniques for a range of matrix sizes For each of the factorization routines, we identify the bottleneck kernels and explain how we have attempted to resolve the bottleneck and to what extent we have been successful Our implementations, for the largest data sets that we use, running on a two-node 32 GHz Cell BladeCenter (exercising a total of sixteen SPEs), on average, deliver 2039, 2846, 815, 2439 and 540 GFLOPS for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, respectively The implementations achieve speedup of 112, 128, 106, 130 and 62, respectively for dense LU, dense Cholesky, sparse Cholesky, QR and SVD, when running on sixteen SPEs We discuss the interesting interactions that result from parallelization of the factorization routines on a two-node non-uniform memory access (NUMA) Cell Blade cluster
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture
Roman Wyrzykowski,Krzysztof Rojek,Lukasz Szustak +2 more
- 01 Apr 2012
TL;DR: A model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors is proposed, followed by a series of systematic optimization steps.
13
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP
TL;DR: This paper describes and analyzes three parallel versions of the dense direct methods that are used in linear system solving on a multicore using an OpenMP interface, and proposes an implementation of the pipelining technique in OpenMP.
10
Patent
Memory-aware matrix factorization
Liangliang Cao,Liana L. Fong,Wei Tan +2 more
- 18 Jun 2015
TL;DR: Memory-aware matrix factorization on a graphics processing unit is discussed in this article, where the authors present a method, systems and computer program products for performing memory-aware matrices factorization.
3
scASK: A novel ensemble framework for classifying cell types based on scRNA-seq data
Bo Liu,Fang-Xiang Wu,Xiufen Zou +2 more
TL;DR: In this article, the authors combined high-resolution atomic force microscopy (AFM) imaging and force spectroscopy to gain insight into the interaction forces between the individual protomers of the hexagonally packed intermediate (HPI) layer of Deinococcus radiodurans.
IPM based sparse LP solver on a heterogeneous processor
Mujahed Eleyat,Lasse Natvig +1 more
TL;DR: The block fan-out algorithm is extended such that small blocks are aggregated into large blocks without adding extra zeros to reduce the overhead of small blocks, and the suggested block aggregation method is able to speedup the whole LP solver of up to 2.5.
1
References
The SPLASH-2 programs: characterization and methodological considerations
Steven Cameron Woo,Moriyoshi Ohara,Evan Torrie,Jaswinder Pal Singh,Anoop Gupta +4 more
- 01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Introduction to the cell multiprocessor
TL;DR: This paper discusses the history of the project, the program objectives and challenges, the disign concept, the architecture and programming models, and the implementation of the Cell multiprocessor.
1.1K
Synergistic Processing in Cell's Multicore Architecture
TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.
463
Solving Real-World Linear Programs: A Decade and More of Progress
TL;DR: One person's perspective on the development of computational tools for linear programming is described, followed by historical remarks covering the some 40 years of linear-programming developments that predate my own involvement in this subject.
Computing the Singular Value Decomposition with High Relative Accuracy
TL;DR: This paper analyzes when it is possible to compute the singular values and singular vectors of a matrix with high relative accuracy, which means that each computed singular value is guaranteed to have some correct digits, even if the singularvalues have widely varying magnitudes.
220