Proceedings Article10.1109/IPDPS.2011.186
Efficient Nonserial Polyadic Dynamic Programming on the Cell Processor
Li Liu,Mu Wang,Jinlei Jiang,Ruizhe Li,Guangwen Yang +4 more
- 16 May 2011
- pp 460-471
TL;DR: This paper proposes to improve one family of DP, nonserial polyadic DP (NPDP), targeting a heterogeneous multi-core architecture, the Cell Broadband Engine, and designs a new data layout which efficiently utilizes the on-chip memory system of the Cell processor.
read more
Abstract: Dynamic programming (DP) is an effective technique for many search and optimization problems. However, the high arithmetic complexity limits its extensive use. Although modern processor architectures with multiple cores and SIMD (single instruction multiple data) instructions provide increasingly high computing power, even the state-of-the-art fully optimized algorithm still largely underutilizes modern multi-core processors. In this paper we propose to improve one family of DP, nonserial polyadic DP (NPDP), targeting a heterogeneous multi-core architecture, the Cell Broadband Engine. We first design a new data layout which efficiently utilizes the on-chip memory system of the Cell processor. Next we devise a CellNPDP algorithm with two tiers. The first tier is a SPE (a co-processor on the Cell processor) procedure which efficiently computes a block of data that can fit into one SPE's local store. The second tier is a parallel procedure which enables all SPEs to efficiently compute all blocks of data. To evaluate CellNPDP, we use both performance modeling and experiments. The performance model reveals that the processor utilization of NPDP can be independent of the problem size. To empirically evaluate CellNPDP, we use two platforms: the IBM QS20 dual-Cell blade and a CPU platform with two latest quad-core CPUs. On both platforms, the processor utilization of CellNPDP is larger than 60%, which demonstrates that our optimizations and CellNPDP can be architecture-independent. Compared to the state-of-the-art fully optimized algorithm on the CPU platform, CellNPDP is 44-fold faster for single-precision and 28-fold faster for double-precision, which is a significant improvement to NPDP.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Mapping dynamic programming algorithms on graphics processing units
Muhammad Kashif Hanif
- 01 Jan 2014
TL;DR: The parallelism exhibited by the compute intensive tasks is studied and a parallel solution based on the matrix-matrix product method onto GPU is devised and the opportunity to use optimized BLAS library provided by CUDA is explored.
Parallel tiled cache and energy efficient codes for O(n4) RNA folding algorithms
TL;DR: This paper applies the previously published approach to automatically tile and parallelize each loop in the studied algorithms by means of the polyhedral model, and compares the performance and energy consumption of generated code with those of code obtained with the state-of-the-art PluTo compiler based on the affine transformation framework.
3
RNA Folding Codes Optimization Using the Intel SDK for OpenCL.
Mateusz Gruzewski,Marek Palkowski +1 more
- 20 Jun 2021
TL;DR: In this paper, the authors applied cache efficient strategies developed within the TRACO and Pluto compilers as well as manual codes of transpose and classical loop skewing, and implemented these codes using Intel SDK for OpenCL.
3
Parallel Tiled Cache and Energy Efficient Code for Zuker’s RNA Folding
Marek Palkowski,Wlodzimierz Bielecki +1 more
- 08 Sep 2019
TL;DR: This paper applies a previously published approach to automatically tile and parallelize each loop in the Zuker RNA Folding loop nest, which is within the polyhedral model, and compares code performance and energy consumption with those obtained with the state-of-the-art PluTo compiler based on the affine transformation framework.
3
A Practical Approach to Tiling Zuker’s RNA Folding Using the Transitive Closure of Loop Dependence Graphs
Marek Palkowski,Wlodzimierz Bielecki +1 more
- 17 Sep 2017
TL;DR: This paper describes the application of a previously published approach to automatic tiling Zuker’s RNA Folding loop nest using the exact polyhedral representation of dependences exposed for this nest and presents the significant speed-up factor of tiled code on a single core of a modern processor.
3
References
Synergistic Processing in Cell's Multicore Architecture
TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.
463
Cell broadband engine architecture and its first implementation: a performance view
TL;DR: It is shown that the Cell/B.E.E., or Cell Broadband Engine, processor can outperform other modern processors by approximately an order of magnitude and by even more in some cases.
432
The potential of the cell processor for scientific computing
Samuel Williams,John Shalf,Leonid Oliker,Shoaib Kamil,Parry Husbands,Katherine Yelick +5 more
- 03 May 2006
TL;DR: This work introduces a performance model for Cell and applies it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs, and proposes modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
The Potential of the Cell Processor for Scientific Computing
Samuel Williams,John Shalf,Leonid Oliker,Parry Husbands,Shoaib Kamil,Katherine Yelick +5 more
- 14 Oct 2005
TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Fast evaluation of internal loops in RNA secondary structure prediction.
TL;DR: A new method to evaluate internal loops utilizing currently used energy rules based on free energy rules for stacking and loop formation is presented, which reduces the time complexity of this part of the structure prediction from O(n4) to O (n3), thus reducing the overall complexity to O( n3).