TL;DR: The Intel Pentium 4's unique performance-monitoring features overcome many limitations and problems found in previous processors.
Abstract: The Intel Pentium 4's unique performance-monitoring features overcome many limitations and problems found in previous processors. Pentium 4 Xeon performance monitoring supports simultaneous multithreaded execution features.
TL;DR: The performance of an Intel Xeon processor enabled with Hyper- Threading Technology is compared to that of a dual Xeon processor that does not have HyperThreading Technology on a range of compute-intensive, data-parallel applications threaded with OpenMP.
Abstract: Intel’s recently introduced Hyper-Threading Technology promises to increase applicationand system-level performance through increased utilization of processor resources. It achieves this goal by allowing the processor to simultaneously maintain the context of multiple instruction streams and execute multiple instruction streams or threads. These multiple streams afford the processor added flexibility in internal scheduling, lowering the impact of external data latency, raising utilization of internal resources, and increasing overall performance. We compare the performance of an Intel Xeon processor enabled with Hyper-Threading Technology to that of a dual Xeon processor that does not have HyperThreading Technology on a range of compute-intensive, data-parallel applications threaded with OpenMP. The applications include both real-world codes and handcoded “kernels” that illustrate performance characteristics of Hyper-Threading Technology. The results demonstrate that, in addition to functionally decomposed applications, the technology is effective for Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. 1 OpenMP is an industry-standard specification for multithreading data-intensive and other highly structured applications in C, C++, and Fortran. See www.openmp.org for more information. many data-parallel applications. Using hardware performance counters, we identify some characteristics of applications that make them especially promising candidates for high performance on threaded processors. Finally, we explore some of the issues involved in threading codes to exploit Hyper-Threading Technology, including a brief survey of both existing and still-needed tools to support multi-threaded software development. INTRODUCTION While the most visible indicator of computer performance is its clock rate, overall system performance is also proportional to the number of instructions retired per clock cycle. Ever-increasing demand for processing speed has driven an impressive array of architectural innovations in processors, resulting in substantial improvements in clock rates and instructions per cycle. One important innovation, super-scalar execution, exploits multiple execution units to allow more than one operation to be in flight simultaneously. While the performance potential of this design is enormous, keeping these units busy requires super-scalar processors to extract independent work, or instructionlevel parallelism (ILP), directly from a single instruction stream. Modern compilers are very sophisticated and do an admirable job of exposing parallelism to the processor; nonetheless, ILP is often limited, leaving some internal processor resources unused. This can occur for a number of reasons, including long latency to main memory, branch mis -prediction, or data dependences in the instruction stream itself. Achieving additional performance often requires tedious performance Intel Technology Journal Q1, 2002. Vol. 6 Issue 1. Hyper-Threading Technology: Impact on Compute-Intensive Workloads 2 analysis, experimentation with advanced compiler optimization settings, or even algorithmic changes. Feature sets, rather than performance, drive software economics. This results in most applications never undergoing performance tuning beyond default comp iler optimization. An Intel processor with Hyper-Threading Technology offers a different approach to increasing performance. By presenting itself to the operating system as two logical processors, it is afforded the benefit of simultaneously scheduling two potentially independent instruction streams [1]. This explicit parallelism complements ILP to increase instructions retired per cycle and increase overall system utilization. This approach is known as simultaneous multi-threading, or SMT. Because the operating system treats an SMT processor as two separate processors, Hyper-Threading Technology is able to leverage the existing base of multithreaded applications and deliver immediate performance gains. To assess the effectiveness of this technology, we first measure the performance of existing multi-threaded applications on systems containing the Intel Xeon processor with Hyper-Threading Technology. We then examine the system’s performance characteristics more closely using a selection of hand-coded application kernels. Finally, we consider the issues and challenges application developers face in creating new threaded applications, including existing and needed tools for efficient multi-threaded development. APPLICATION SCOPE While many existing applications can benefit from Hyper-Threading Technology, we focus our attention on single-process, numerically intensive applications. By numerically intensive, we mean applications that rarely wait on external inputs, such as remote data sources or network requests, and instead work out of main system memory. Typical examples include mechanical design analysis, multi-variate optimization, electronic design automation, genomics, photo-realistic rendering, weather forecasting, and computational chemistry. A fast turnaround of results normally provides significant value to the users of these applications Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. through better quality products delivered more quickly to market. The data-intensive nature of these codes, paired with the demand for better performance, makes them ideal candidates for multi-threaded speed-up on shared memory multi-processor (SMP) systems. We considered a range of applications, threaded with OpenMP, that show good speed-up on SMP systems. The applications and their problem domains are listed in Table 1. Each of these applications achieves 100% processor utilization from the operating system’s point of view. Despite external appearances, however, internal processor resources often remain underutilized. For this reason, these applications appeared to be good candidates for additional speed-up via Hyper-Threading Technology. Table 1: Applications type
TL;DR: How Hyper-Threading Technology impacts pre-silicon validation, the new validation challenges created by this technology, and the strategy for pre- silicon validation are described are described.
Abstract: Hyper-Threading Technology delivers significantly improved architectural performance at a lower-thantraditional power consumption and die size cost. However, increased logic complexity is one of the trade-offs of this technology. Hyper-Threading Technology exponentially increases the micro-architectural state space, decreases validation controllability, and creates a number of new and interesting micro-architectural boundary conditions. On the Intel Xeon processor family, which implements two logical processors per physical processor, there are multiple, independent logical processor selection points that use several algorithms to determine logical processor selection. Four types of resources: Duplicated, Fully Shared, Entry Tagged, and Partitioned, are used to support the technology. This complexity adds to the presilicon validation challenge. Not only is the architectural state space much larger (see “Hyper-Threading Technology Architecture and Microarchitecture” in this issue of the Intel Technology Journal), but also a temporal factor is involved. Testing an architectural state may not be effective if one logical processor is halted before the other logical processor is halted. The multiple, independent, logical processor selection points and interference from simultaneously executing instructions reduce controllability. This in turn increases the difficulty of setting up precise boundary conditions to test. Supporting four resource types creates new validation conditions such as cross-logical processor corruption of the architectural state. Moreover, HyperThreading Technology provides support for interand intra-logical processor store to load forwarding, greatly increasing the challenge of memory ordering and memory coherency validation. Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. This paper describes how Hyper-Threading Technology impacts pre-silicon validation, the new validation challenges created by this technology, and our strategy for pre-silicon validation. Bug data are then presented and used to demonstrate the effectiveness of our pre-silicon Hyper-Threading Technology validation. INTRODUCTION Intel IA-32 processors that feature the Intel NetBurst microarchitecture can also support Hyper-Threading Technology or simultaneous multi-threading (SMT). Presilicon validation of Hyper-Threading Technology was successfully accomplished in parallel with the Pentium 4 processor pre-silicon validation, and it leveraged the Pentium 4 processor pre-silicon validation techniques of Formal Verification (FV), Cluster Test Environments (CTEs), Architecture Validation (AV), and CoverageBased Validation. THE CHALLENGES OF PRE-SILICON HYPER-THREADING TECHNOLOGY VALIDATION The main validation challenge presented by HyperThreading Technology is an increase in complexity that manifested itself in these major ways: • Project management issues • An increase in the number of operating modes: MTmode, ST0-mode, and ST1-mode, each described in “Hyper-Threading Technology Architecture and Microarchitecture” in this issue of the Intel Technology Journal. Intel and Pentium are registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. NetBurst is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. Intel Technology Journal Q1, 2002. Vol. 6 Issue 1. Pre-Silicon Validation of Hyper-Threading Technology 2 • Hyper-Threading Technology squared the architectural state space. • A decrease in controllability. • An increase in the number and complexity of microarchitectural boundary conditions. • New validation concerns for logical processor starvation and fairness. Microprocessor validation already was an exercise in the intractable engineering problem of ensuring the correct functionality of an immensely complex design with a limited budget and on a tight schedule. Hyper-Threading Technology made it even more intractable. HyperThreading Technology did not demand entirely new validation methods and it did fit within the already planned Pentium 4 processor validation framework of formal verification, cluster testing, architectural validation, and coverage-based microarchitectural validation. What Hyper-Threading Technology did require, however, was an increase in validation staffing and a significant increase in computing capacity.
TL;DR: This work demonstrates the speedup achieved by the MPI (Message Passing Interface) parallel implementation of the Steepest Descent Fast Multipole Method (SDFMM), which has already been optimized to take advantage of the structure of the physics of scattering problems.
Abstract: The computational solution of large-scale linear systems of equations necessitates the use of fast algorithms but is also greatly enhanced by employing parallelization techniques. The objective of this work is to demonstrate the speedup achieved by the MPI (Message Passing Interface) parallel implementation of the Steepest Descent Fast Multipole Method (SDFMM). Although this algorithm has already been optimized to take advantage of the structure of the physics of scattering problems, there is still the opportunity to speed up the calculation by dividing tasks into components using multiple processors and solve them in parallel. The SDFMM has three bottlenecks ordered as (1) filling the sparse impedance matrix associated with the near-field Method of Moments interactions (MoM), (2) the matrix vector multiplications associated with this sparse matrix (3) the far field interactions associated with the fast multipole method. The parallel implementation task is accomplished using a thirty-one node Intel Pentium Beowulf cluster and is also validated on a 4-processor Alpha workstation. The Beowulf cluster consists of thirty-one nodes of 350MHz Intel Pentium IIs with 256 MB of RAM and one node of a 4x450MHz Intel Pentium II Xeon shared memory processor with 2GB of RAM with all nodes connected to a 100 BaseTX Ethernet network. The Alpha workstation has a maximum of four 667MHz processors. Our numerical results show significant linear speedup in filling the sparse impedance matrix. Using the 32-processors on the Beowulf cluster lead to achieve a 7.2 overall speedup while a 2.5 overall speedup is gained using the 4-processors on the Alpha workstation.
TL;DR: The Hyper-Threading Technology architecture is described, and the microarchitecture details of Intel's first implementation on the Intel Xeon processor family are discussed, which is an important addition to Intel's enterprise product line and will be integrated into a wide variety of products.
Abstract: Intel’s Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. Hyper-Threading Technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors. From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources. This paper describes the Hyper-Threading Technology architecture, and discusses the microarchitecture details of Intel's first implementation on the Intel Xeon processor family. Hyper-Threading Technology is an important addition to Intel’s enterprise product line and will be integrated into a wide variety of products. Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. INTRODUCTION The amazing growth of the Internet and telecommunications is powered by ever-faster systems demanding increasingly higher levels of processor performance. To keep up with this demand we cannot rely entirely on traditional approaches to processor design. Microarchitecture techniques used to achieve past processor performance improvement–superpipelining, branch prediction, super-scalar execution, out-of-order execution, caches–have made microprocessors increasingly more complex, have more transistors, and consume more power. In fact, transistor counts and power are increasing at rates greater than processor performance. Processor architects are therefore looking for ways to improve performance at a greater rate than transistor counts and power dissipation. Intel’s Hyper-Threading Technology is one solution. Processor Microarchitecture Traditional approaches to processor design have focused on higher clock speeds, instruction-level parallelism (ILP), and caches. Techniques to achieve higher clock speeds involve pipelining the microarchitecture to finer granularities, also called super-pipelining. Higher clock frequencies can greatly improve performance by increasing the number of instructions that can be executed each second. Because there will be far more instructions in-flight in a superpipelined microarchitecture, handling of events that disrupt the pipeline, e.g., cache misses, interrupts and branch mispredictions, can be costly. Intel Technology Journal Q1, 2002 Hyper-Threading Technology Architecture and Microarchitecture 2 ILP refers to techniques to increase the number of instructions executed each clock cycle. For example, a super-scalar processor has multiple parallel execution units that can process instructions simultaneously. With super-scalar execution, several instructions can be executed each clock cycle. However, with simple inorder execution, it is not enough to simply have multiple execution units. The challenge is to find enough instructions to execute. One technique is out-of-order execution where a large window of instructions is simultaneously evaluated and sent to execution units, based on instruction dependencies rather than program order. Accesses to DRAM memory are slow compared to execution speeds of the processor. One technique to reduce this latency is to add fast caches close to the processor. Caches can provide fast memory access to frequently accessed data or instructions. However, caches can only be fast when they are small. For this reason, processors often are designed with a cache hierarchy in which fast, small caches are located and operated at access latencies very close to that of the processor core, and progressively larger caches, which handle less frequently accessed data or instructions, are implemented with longer access latencies. However, there will always be times when the data needed will not be in any processor cache. Handling such cache misses requires accessing memory, and the processor is likely to quickly run out of instructions to execute before stalling on the cache miss. The vast majority of techniques to improve processor performance from one generation to the next is complex and often adds significant die-size and power costs. These techniques increase performance but not with 100% efficiency; i.e., doubling the number of execution units in a processor does not double the performance of the processor, due to limited parallelism in instruction flows. Similarly, simply doubling the clock rate does not double the performance due to the number of processor cycles lost to branch mispredictions. 0 5 10 15 20 25