TL;DR: This paper presents the first prototype of the XMT architecture that incorporates 64 simple in-order processors operating at 75MHz, and the microarchitecture of the prototype is described and the performance is studied with respect to some micro-benchmarks.
Abstract: PRAM (Parallel Random Access Model) has been widely regarded a desirable parallel machine model for many years, but it is also believed to be "impossible in reality." As the new billion-transistor processor era begins, the eXplicit Multi-Threading (XMT) PRAM-On-Chip project is attempting to design an on-chip parallel processor that efficiently supports PRAM algorithms. This paper presents the first prototype of the XMT architecture that incorporates 64 simple in-order processors operating at 75MHz. The microarchitecture of the prototype is described and the performance is studied with respect to some micro-benchmarks. Using cycle accurate emulation, the projected performance of an 800MHz XMT ASIC processor is compared with AMD Opteron 2.6GHz, which uses similar area as would a 64-processor ASIC version of the XMT prototype. The results suggest that an only 800MHz XMT ASIC system outperforms AMD Opteron 2.6GHz, with speedups ranging between 1.57 and 8.56.
TL;DR: This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP), and introduces Explicit Multi-Threading (XMT).
Abstract: This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced; new elements are added where needed.
TL;DR: This brief announcement (BA) reports first commitment to silicon of XMT, a 64-processor, 75MHz computer based on fieldprogrammable gate array (FPGA) technology was built at the University of Maryland (UMD), and a significant milestone for the broad PRAM-On-Chip project at UMD.
Abstract: Introduction The eXplicit Multi-Threading (XMT) on-chip general-purpose computer architecture is aimed at the classic goal of reducing single task completion time. It is a parallel algorithmic architecture in the sense that: (i) it seeks to provide good performance for parallel programs derived from Parallel Random Access Machine/Model (PRAM) algorithms, and (ii) a methodology for advancing from PRAM algorithms to XMT programs, along with a performance metric and its empirical validation are provided [1]. Ease of parallel programming is now widely recognized as the main stumbling block for extending commodity computer performance growth (e.g., using multicores). XMT provides a unique answer to this challenge. This brief announcement (BA) reports first commitment to silicon of XMT. A 64-processor, 75MHz computer based on fieldprogrammable gate array (FPGA) technology was built at the University of Maryland (UMD). XMT was introduced in SPAA’98. An architecture simulator and speed-up results on several kernels were reported in SPAA’01. The new computer is a significant milestone for the broad PRAM-On-Chip project at UMD. In fact, contributions in the current BA include several stages since SPAA’01: completion of the design using a hardware description language (HDL), synthesis into gate level “netlist”, as well as validation of the design in real hardware. This overall progress, its context and uses of the much faster hardware over a simulator are the focus of this BA. The PRAM virtual model of computation assumes that any number of concurrent accesses to a shared memory take the same time as a single access. In the Arbitrary Concurrent-Read Concurrent-Write (CRCW) PRAM concurrent access to the same memory location for reads or writes are allowed. Reads are resolved before writes and an arbitrary write unknown in advance succeeds. Design of an efficient parallel algorithm for the Arbitrary CRCW PRAM model would seek to optimize the total number of operations the algorithms performs (“work”) and its parallel time (“depth”) assuming unlimited hardware. Given such an algorithm, an XMT program is written in XMTC, which is a modest single-program multiple-data (SPMD) multi-threaded extension of C that includes 3 commands: Spawn, Join and PS, for Prefix-Sum—a Fetch-and-Increment-like command. The program seeks to optimize: (i) the length of the (longest) sequence of round trips to memory (LSRTM), (ii) queuing delay to the
TL;DR: Simulation based on full network layout demonstrates that significant throughput improvement can be achieved over the original proposed MoT interconnection network, which was previously shown to be competitive with traditional network solutions.
Abstract: A mesh of trees (MoT) on-chip interconnection network has been proposed recently to provide high throughput between memory units and processors for single-chip parallel processing (Balkan et al., 2006). In this paper, we report our findings in bringing this concept to silicon. Specifically, we conduct cycle-accurate Verilog simulations to verify the analytical results claimed in (Balkan et al., 2006). We synthesize and obtain the layout of the MoT interconnection networks of various sizes. To further improve throughput, we investigate different arbitration primitives to handle load and store, the two most common memory operations. We also study the use of pipeline registers in large networks when there are long wires. Simulation based on full network layout demonstrates that significant throughput improvement can be achieved over the original proposed MoT interconnection network. The importance of this work lies in its validation of performance features of the MoT interconnection network, as they were previously shown to be competitive with traditional network solutions. The MoT network is currently used in an eXplicit multi-threading (XMT) on-chip parallel processor, which is engineered to support parallel programming. In that context, a 32-terminal MoT network could support up to 512 on-chip XMT processors. Our 8-terminal network that could serve 8 processor clusters (or 128 total processors), was also accepted recently for fabrication.
TL;DR: Test results show potential for achieving more than a hundred-fold speedup over a serial implementation of Explicit-multi-threading, indicating an interesting possibility for a certain type of a single chip multicore architecture: use an existing easy-to-program API, such as VHDL or Verilog, for reduced application-software development time and better performance over serial performance-driven languages.
Abstract: Explicit-multi-threading (XMT) is a parallel programming approach for exploiting on-chip parallelism Its fine-grained single program multiple data (SPMD) programming model is suitable for many computing intensive applications In this paper, we present a parallel gate level logic simulator implemented on an XMT platform and study its performance Test results show potential for achieving more than a hundred-fold speedup over a serial implementation This indicates an interesting possibility for a certain type of a single chip multicore architecture: use an existing easy-to-program API, such as VHDL or Verilog, for reduced application-software development time and better performance over serial performance-driven languages, such as C