TL;DR: This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors, and proposes a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block.
Abstract: This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code.This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained.The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
TL;DR: This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models and characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Abstract: Module scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative module scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
TL;DR: The Bulldog compiler described here uses several new compilation techniques: trace scheduling to find more parallelism, memory-reference and memorybank disambiguation to increase memory bandwidth, and new code-generation algorithms.
Abstract: "Bulldog "demonstrates that a symbiosis of new Very Long Instruction Word (VLIW) architectures and new compiling technology is practicable.VLIW architectures are reduced-instruction-set machines with a large number of parallel, pipelined functional units but only a single thread of control. These machines offer the promise of an immediate order-of-magnitude increase in speed for general purpose scientific computing. However, a traditional compiler can't find enough parallelism in scientific programs to utilize a VLIW effectively. The Bulldog compiler described here uses several new compilation techniques: trace scheduling to find more parallelism, memory-reference and memorybank disambiguation to increase memory bandwidth, and new code-generation algorithms.Although originally developed for VLIWs, many of the ideas in "Bulldog "could be applied to pipelined reduced-instruction-set architectures such as the MIPS. Ellis's experiments indicate that speed improvements of thirty to eighty percent are possible for scientific code on such machines.John R. Ellis received his doctorate from Yale University and is currently Principal Software Engineer, Digital Equipment Corporation Systems Research Center, Palo Alto. "Bulldog: A Compiler for VLIW Architectures" is winner of the 1985 ACM Doctoral Dissertation Award.
TL;DR: This work designs a novel and flexible technique, called rotation scheduling, for scheduling cyclic DFGs using loop pipelining, and provides a theoretical basis for the operations based on retiming.
Abstract: We consider the resource-constrained scheduling of loops with inter-iteration dependencies. A loop is modeled as a data flow graph (DFG), where edges are labeled with the number of iterations between dependencies. We design a novel and flexible technique, called rotation scheduling, for scheduling cyclic DFGs using loop pipelining. The rotation technique repeatedly transforms a schedule to a more compact schedule. We provide a theoretical basis for the operations based on retiming. We propose two heuristics to perform rotation scheduling, and give experimental results showing that they have very good performance.
TL;DR: A scheme for global (intra-loop) scheduling is proposed, which uses the control and data dependence information summarized in a Program Dependence Graph, to move instructions well beyond basic block boundaries.
Abstract: To improve the utilization of machine resources in superscalar processors, the instructions have to be carefully scheduled by the compiler. As internal parallelism and pipelining increases, it becomes evident that scheduling should be done beyond the basic block level. A scheme for global (intra-loop) scheduling is proposed, which uses the control and data dependence information summarized in a Program Dependence Graph, to move instructions well beyond basic block boundaries. This novel scheduling framework is based on the parametric description of the machine architecture, which spans a range of superscakis and VLIW machines, and exploits speculative execution of instructions to further enhance the performance of the general code. We have implemented our algorithms in the IBM XL family of compilers and have evaluated them on the IBM RISC System/6000 machines.