Efficient Nonserial Polyadic Dynamic Programming on the Cell Processor

doi:10.1109/IPDPS.2011.186

Proceedings Article10.1109/IPDPS.2011.186

Efficient Nonserial Polyadic Dynamic Programming on the Cell Processor

Li Liu, +4 more

- 16 May 2011

- pp 460-471

16

TL;DR: This paper proposes to improve one family of DP, nonserial polyadic DP (NPDP), targeting a heterogeneous multi-core architecture, the Cell Broadband Engine, and designs a new data layout which efficiently utilizes the on-chip memory system of the Cell processor.

Abstract: Dynamic programming (DP) is an effective technique for many search and optimization problems. However, the high arithmetic complexity limits its extensive use. Although modern processor architectures with multiple cores and SIMD (single instruction multiple data) instructions provide increasingly high computing power, even the state-of-the-art fully optimized algorithm still largely underutilizes modern multi-core processors. In this paper we propose to improve one family of DP, nonserial polyadic DP (NPDP), targeting a heterogeneous multi-core architecture, the Cell Broadband Engine. We first design a new data layout which efficiently utilizes the on-chip memory system of the Cell processor. Next we devise a CellNPDP algorithm with two tiers. The first tier is a SPE (a co-processor on the Cell processor) procedure which efficiently computes a block of data that can fit into one SPE's local store. The second tier is a parallel procedure which enables all SPEs to efficiently compute all blocks of data. To evaluate CellNPDP, we use both performance modeling and experiments. The performance model reveals that the processor utilization of NPDP can be independent of the problem size. To empirically evaluate CellNPDP, we use two platforms: the IBM QS20 dual-Cell blade and a CPU platform with two latest quad-core CPUs. On both platforms, the processor utilization of CellNPDP is larger than 60%, which demonstrates that our optimizations and CellNPDP can be architecture-independent. Compared to the state-of-the-art fully optimized algorithm on the CPU platform, CellNPDP is 44-fold faster for single-precision and 28-fold faster for double-precision, which is a significant improvement to NPDP.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Dissertation•10.15480/882.1184

Mapping dynamic programming algorithms on graphics processing units

Muhammad Kashif Hanif

- 01 Jan 2014

TL;DR: The parallelism exhibited by the compute intensive tasks is studied and a parallel solution based on the matrix-matrix product method onto GPU is devised and the opportunity to use optimized BLAS library provided by CUDA is explored.

...read moreread less

8

Journal Article•10.1016/J.JPDC.2019.12.007

Parallel tiled cache and energy efficient codes for O(n4) RNA folding algorithms

Marek Palkowski, +1 more

- 01 Mar 2020

- Journal of Parallel and Distributed Comp...

TL;DR: This paper applies the previously published approach to automatically tile and parallelize each loop in the studied algorithms by means of the polyhedral model, and compares the performance and energy consumption of generated code with those of code obtained with the state-of-the-art PluTo compiler based on the affine transformation framework.

...read moreread less

3

Book Chapter•10.1007/978-3-030-87897-9_42

RNA Folding Codes Optimization Using the Intel SDK for OpenCL.

Mateusz Gruzewski, +1 more

- 20 Jun 2021

TL;DR: In this paper, the authors applied cache efficient strategies developed within the TRACO and Pluto compilers as well as manual codes of transpose and classical loop skewing, and implemented these codes using Intel SDK for OpenCL.

...read moreread less

3

Book Chapter•10.1007/978-3-030-43222-5_3

Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding

Marek Palkowski, +1 more

- 08 Sep 2019

TL;DR: This paper applies a previously published approach to automatically tile and parallelize each loop in the Zuker RNA Folding loop nest, which is within the polyhedral model, and compares code performance and energy consumption with those obtained with the state-of-the-art PluTo compiler based on the affine transformation framework.

...read moreread less

3

Book Chapter•10.1007/978-3-319-67229-8_18

A Practical Approach to Tiling Zuker’s RNA Folding Using the Transitive Closure of Loop Dependence Graphs

Marek Palkowski, +1 more

- 17 Sep 2017

TL;DR: This paper describes the application of a previously published approach to automatic tiling Zuker’s RNA Folding loop nest using the exact polyhedral representation of dependences exposed for this nest and presents the significant speed-up factor of tiled code on a single core of a modern processor.

...read moreread less

3

...

Expand

References

Journal Article•10.1109/MM.2006.41

Synergistic Processing in Cell's Multicore Architecture

Michael K. Gschwind, +5 more

- 01 Mar 2006

- IEEE Micro

TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.

...read moreread less

463

Journal Article•10.1147/RD.515.0559

Cell broadband engine architecture and its first implementation: a performance view

Tong Chen, +3 more

- 01 Sep 2007

- Ibm Journal of Research and Development

TL;DR: It is shown that the Cell/B.E.E., or Cell Broadband Engine, processor can outperform other modern processors by approximately an order of magnitude and by even more in some cases.

...read moreread less

432

•Proceedings Article•10.1145/1128022.1128027

The potential of the cell processor for scientific computing

Samuel Williams, +5 more

- 03 May 2006

TL;DR: This work introduces a performance model for Cell and applies it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs, and proposes modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.

...read moreread less

373

The Potential of the Cell Processor for Scientific Computing

Samuel Williams, +5 more

- 14 Oct 2005

TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.

...read moreread less

347

•Journal Article•10.1093/BIOINFORMATICS/15.6.440

Fast evaluation of internal loops in RNA secondary structure prediction.

Rune B. Lyngsø, +2 more

- 01 Jun 1999

- Bioinformatics

TL;DR: A new method to evaluate internal loops utilizing currently used energy rules based on free energy rules for stacking and loop formation is presented, which reduces the time complexity of this part of the structure prediction from O(n4) to O (n3), thus reducing the overall complexity to O( n3).

...read moreread less

164

...

Expand

Efficient Nonserial Polyadic Dynamic Programming on the Cell Processor

Chat with Paper

AI Agents for this Paper

Citations

Mapping dynamic programming algorithms on graphics processing units

Parallel tiled cache and energy efficient codes for O(n4) RNA folding algorithms

RNA Folding Codes Optimization Using the Intel SDK for OpenCL.

Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding

A Practical Approach to Tiling Zuker’s RNA Folding Using the Transitive Closure of Loop Dependence Graphs

References

Synergistic Processing in Cell's Multicore Architecture

Cell broadband engine architecture and its first implementation: a performance view

The potential of the cell processor for scientific computing

The Potential of the Cell Processor for Scientific Computing

Fast evaluation of internal loops in RNA secondary structure prediction.

Related Papers (5)

A practical automatic polyhedral parallelizer and locality optimizer

GTfold: a scalable multicore code for RNA secondary structure prediction

Rapid RNA Folding: Analysis and Acceleration of the Zuker Recurrence

Cache and energy efficient algorithms for Nussinov’s RNA Folding

Algorithms for Loop Matchings