Topic

Loop tiling

About: Loop tiling is a research topic. Over the lifetime, 683 publications have been published within this topic receiving 21494 citations.

...read moreread less

Topic Tools

Find unexplored research gaps

Generate a literature review

Explore related concepts

Papers published on a yearly basis

Papers

Proceedings Article•10.1145/2684746.2689060•

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

[...]

Chen Zhang¹, Peng Li², Guangyu Sun¹, Yijin Guan¹, Bingjun Xiao², Jason Cong² - Show less +2 more•Institutions (2)

Peking University¹, University of California, Los Angeles²

22 Feb 2015

TL;DR: This work implements a CNN accelerator on a VC707 FPGA board and compares it to previous approaches, achieving a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

...read moreread less

Abstract: Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.

...read moreread less

2,406 citations

Proceedings Article•10.1145/113445.113449•

A data locality optimizing algorithm

[...]

Michael Wolf¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

1 May 1991

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

Abstract: This paper proposes an algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling. The loop transformation algorithm is based on two concepts: a mathematical formulation of reuse and locality, and a loop transformation theory that unifies the various transforms as unimodular matrix transformations.The algorithm has been implemented in the SUIF (Stanford University Intermediate Format) compiler, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation (SOR), LU decomposition without pivoting, and Givens QR factorization. Performance evaluation indicates that locality optimization is especially crucial for scaling up the performance of parallel code.

...read moreread less

1,423 citations

Proceedings Article•10.1145/106972.106981•

The cache performance and optimizations of blocked algorithms

[...]

Monica D. Lam, Edward E. Rothberg, Michael E. Wolf

1 Apr 1991

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

...read moreread less

Abstract: Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

...read moreread less

1,058 citations

Journal Article•10.1109/71.97902•

A loop transformation theory and an algorithm to maximize parallelism

[...]

Michael Wolf¹, Monica S. Lam¹•Institutions (1)

Stanford University¹

01 Oct 1991-IEEE Transactions on Parallel and Distributed Systems

TL;DR: The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest and it is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fullypermutable loop nests and wavefronting the fully permutable nests.

...read moreread less

Abstract: An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five. >

...read moreread less

727 citations

Proceedings Article•10.1145/73560.73588•

Supernode partitioning

[...]

François Irigoin¹, R. Triolet¹•Institutions (1)

École Normale Supérieure¹

13 Jan 1988

TL;DR: A class of partitionings is presented that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two level of memory.

...read moreread less

Abstract: Supercompilers must reschedule computations defined by nested DO-loops in order to make an efficient use of supercomputer features (vector units, multiple elementary processors, cache memory, etc…). Many rescheduling techniques like loop interchange, loop strip-mining or rectangular partitioning have been described to speedup program execution. We present here a class of partitionings that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two levels of memory.

...read moreread less

635 citations

...

Expand

Performance Metrics

704

Papers

7,273

Citations

No. of papers in the topic in previous years
Year	Papers
2024	1
2023	5
2022	13
2021	12
2020	14
2019	14

Loop tiling

Topic Tools

Papers published on a yearly basis

Papers

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

A data locality optimizing algorithm

The cache performance and optimizations of blocked algorithms

A loop transformation theory and an algorithm to maximize parallelism

Supernode partitioning

Related Topics (5)

Performance Metrics