About: DEC Alpha is a research topic. Over the lifetime, 79 publications have been published within this topic receiving 1467 citations. The topic is also known as: Alpha processor & alpha.
TL;DR: The CRAY T3E is a scalable shared-memory multiprocessor based on the DEC Alpha 21164 microprocessor, which includes a number of architectural features designed to tolerate latency and enhance scalability.
Abstract: The CRAY T3E is a scalable shared-memory multiprocessor based on the DEC Alpha 21164 microprocessor. The system includes a number of novel architectural features designed to tolerate latency, enhance scalability, and deliver high performance on scientific and engineering codes. Included among these are stream buffers, which detect and prefetch down small-stride reference streams, E-registers, which provide latency hiding and non-unit-stride access capabilities, barrier and fetch_and_op synchronization support, and a scalable, high-bandwidth interconnection network.This paper reports our experiences with the CRAY T3E and presents a variety of performance measurements. Section 2 provides a brief overview of the system architecture. Section 3 describes the latency-hiding features (caches, stream buffers and E-registers) in more detail, assesses their performance impact, and discusses coding techniques for using them. Section 4 presents single-processor performance results. Finally, Section 5 discusses system scalability.
TL;DR: Simulation-based functional verification was performed on the logic design using implementation-directed, pseudo-random exercisers, supplemented with implementation-specific, hand-generated tests, and extensive functional coverage analysis was performed to grade and direct the verification effort.
Abstract: DIGITAL's Alpha 21264 processor is a highly out-of-order, superpipelined, superscalar implementation of the Alpha architecture, capable of a peak execution rate of six instructions per cycle and a sustainable rate of four per cycle. The 21264 also features a 500 MHz clock speed and a high-bandwidth system interface that channels up to 5.3 Gbytes/second of cache data and 2.6 Gbytes/second of main-memory data into the processor. Simulation-based functional verification was performed on the logic design using implementation-directed, pseudo-random exercisers, supplemented with implementation-specific, hand-generated tests. Extensive functional coverage analysis was performed to grade and direct the verification effort. The success of the verification effort was underscored by first prototype chips which were used to boot multiple operating systems across several different prototype systems.
TL;DR: It is found that the memory system---which has long been known to dominate network throughput---is also a key factor in protocol latency, and improving instruction cache effectiveness can greatly reduce protocol processing overheads.
Abstract: This paper describes several techniques designed to improve protocol latency, and reports on their effectiveness when measured on a modern RISC machine employing the DEC Alpha processor. We found that the memory system---which has long been known to dominate network throughput---is also a key factor in protocol latency. As a result, improving instruction cache effectiveness can greatly reduce protocol processing overheads. An important metric in this context is the memory cycles per instructions (mCPI), which is the average number of cycles that an instruction stalls waiting for a memory access to complete. The techniques presented in this paper reduce the mCPI by a factor of 1.35 to 5.8. In analyzing the effectiveness of the techniques, we also present a detailed study of the protocol processing behavior of two protocol stacks---TCP/IP and RPC---on a modern RISC processor.
TL;DR: A technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols using a software counter to compute the number of instructions between nondeterministic events in normal operation is presented.
Abstract: We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.
TL;DR: Basic architecture instruction formats instruction descriptions system architecture and programming implications common PALcode architecture console subsystem overview input/output overview DEC OSF/1 PALcode instruction descriptions DEC OSf/1 memory management DEC OSFs memory management exceptions, interrupts and machine checks.
Abstract: Basic architecture instruction formats instruction descriptions system architecture and programming implications common PALcode architecture console subsystem overview input/output overview DEC OSF/1 PALcode instruction descriptions DEC OSF/1 memory management DEC OSF/1 process structure DEC OSF/1 exceptions and interrupts processor, process and thread structures and registers memory management exceptions, interrupts and machine checks Windows NT AXP PALcode instruction descriptions initialization and firmware transitions console interface to: operating system software system bootstrapping.