TL;DR: OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework that leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design.
Abstract: Industry is building larger, more complex, manycore processors on the back of strong institutional knowledge, but academic projects face difficulties in replicating that scale. To alleviate these difficulties and to develop and share knowledge, the community needs open architecture frameworks for simulation, synthesis, and software exploration which support extensibility, scalability, and configurability, alongside an established base of verification tools and supported software. In this paper we present OpenPiton, an open source framework for building scalable architecture research prototypes from 1 core to 500 million cores. OpenPiton is the world's first open source, general-purpose, multithreaded manycore processor and framework. OpenPiton leverages the industry hardened OpenSPARC T1 core with modifications and builds upon it with a scratch-built, scalable uncore creating a flexible, modern manycore design. In addition, OpenPiton provides synthesis and backend scripts for ASIC and FPGA to enable other researchers to bring their designs to implementation. OpenPiton provides a complete verification infrastructure of over 8000 tests, is supported by mature software tools, runs full-stack multiuser Debian Linux, and is written in industry standard Verilog. Multiple implementations of OpenPiton have been created including a taped-out 25-core implementation in IBM's 32nm process and multiple Xilinx FPGA prototypes.
TL;DR: BOOM is a synthesizable, parameterized, superscalar out-of-order RISC-V core designed to serve as the prototypical baseline processor for future micro-architectural studies of out- of-order processors.
Abstract: : BOOM is a synthesizable, parameterized, superscalar out-of-order RISC-V core designed to serve as the prototypical baseline processor for future micro-architectural studies of out-of-order processors. Our goal is to provide a readable, open-source implementation for use in education, research, and industry. BOOM is written in roughly 9,000 lines of the hardware construction language Chisel. We leveraged Berkeleys open-source Rocket-chip SoC generator, allowing us to quickly bring up an entire multi-core processor system (including caches and uncore) by replacing the in-order Rocket core with an out-of-order BOOM core. BOOM supports atomics, IEEE754-2008 floating-point, and page-based virtual memory. We have demonstrated BOOM running Linux, SPEC CINT2006, and CoreMark.
TL;DR: A new timing simulator is presented that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarch Architecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion).
Abstract: For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions).
TL;DR: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process that implements both sleep and shut-off leakage reduction modes and employs multiple voltage and clock domains to reduce power.
Abstract: This paper describes a dual-core 64-b Xeon MP processor implemented in a 65-nm eight-metal process. The 435-mm2 die has 1.328-B transistors. Each core has two threads and a unified 1-MB L2 cache. The 16-MB shared, 16-way set-associative L3 cache implements both sleep and shut-off leakage reduction modes. Long channel transistors are used to reduce subthreshold leakage in cores and uncore (all portions of the die that are outside the cores) control logic. Multiple voltage and clock domains are employed to reduce power
TL;DR: The interconnect features, as well as the capabilities built into the processor’s system interconnect logic (also known as “uncore”), work together to deliver the performance, scalability, and reliability demanded in larger scale systems.
Abstract: Single processor performance has exhibited substantial growth over the last three decades [1] as shown in Figure 1. What is also desired are techniques which enable connecting together multiple processors in order to create scalable, modular and resilient multiprocessor systems. Beginning with the production of the Intel® Xeon® processor 5500 series, (previously codenamed “Nehalem-EP”), the Intel® Xeon® processor 7500 series (previously codenamed “Nehalem-EX”), and the Intel® Itanium™ processor 9300 series (previously codenamed “Tukwila-MC”), Intel Corporation has introduced a series of multi-core processors that can be easily interconnected to create server systems scaling from 2 to 8 sockets. In addition, OEM platforms are currently available that extend this up to 256-socket server designs1. This scalable system architecture is built upon the foundation of the Intel® QuickPath Interconnect (Intel QPI). These Intel micro-architectures provide multiple high-speed (currently up to 25.6 GB/s), point-to-point connections between processors, I/O hubs and third party node controllers. The interconnect features, as well as the capabilities built into the processor’s system interconnect logic (also known as “uncore”), work together to deliver the performance, scalability, and reliability demanded in larger scale systems.