TL;DR: In this article, the design of the Slot 2 Single Edge Contact (S.E.C.) cartridge is discussed and a method to enhance the power handling capability by changing the L2 cache layouts and using conductive coupling to the thermal plate is highlighted.
Abstract: The Pentium ® II Xeon processor utilizes Intel-developed Slot 2 Single Edge Contact (S.E.C.) cartridge packaging technology. The thermal design of S.E.C. cartridge poses challenges because of the need to meet the thermal performance requirement of multiple packages within an enclosure. The complexity arises from the requirement to accommodate different core packages and varying numbers and locations of L2 cache packages on the processor substrate. The paper illustrates the substrate design, thermal plate design, thermal bus bar design, and thermal interface material selection for different core packages. The method to enhance the L2 cache power handling capability by changing the L2 cache layouts and using conductive coupling to the thermal plate is highlighted. The Slot 2 S.E.C cartridge design can be extended for all the future products within this technology envelope.
TL;DR: The design for a distributed system supporting this previously unparallelized application of Concise Cross Correlation is presented, and the experiences implementing a master-slave distributed version of CCC utilizing MPI are commented on.
Abstract: Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3D flow velocity measurement. Due to the 3D nature of this application, very large computation and communication requirements are imposed. An innovation algorithm, the Concise Cross Correlation (CCC), is employed in the system to extract velocity field from the hologram of the test flows. With CCC we achieved a compression ratio of 10/sup 4/ and a processing speed 1000 times faster than with traditional 3D FFT-based correlation. To further accelerate the processing speed for fully time- and space-resolved measurement, parallel processing is necessary. We present our design for a distributed system supporting this previously unparallelized application, and comment on our experiences implementing a master-slave distributed version of CCC utilizing MPI. Brief experimental results on Gigabit Ethernet and multiprocessor Pentium Xeon systems are given.
TL;DR: This paper examines four commercial DBMSs running on an Intel Xeon and NT 4.0 and introduces a framework for analyzing query execution time, and finds that database developers should not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues.
Abstract: Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studies report that faster processors do not improve database system performance to the same extent as scientific workloads. Recent work on database systems focusing on minimizing memory latencies, such as cache-conscious algorithms for sorting and data placement, is one step toward addressing this problem. However, to best design high performance DBMSs we must carefully evaluate and understand the processor and memory behavior of commercial DBMSs on today’s hardware platforms. In this paper we answer the question “Where does time go when a database system is executed on a modern computer platform?” We examine four commercial DBMSs running on an Intel Xeon and NT 4.0. We introduce a framework for analyzing query execution time on a DBMS running on a server with a modern processor and memory architecture. To focus on processor and memory interactions and exclude effects from the I/O subsystem, we use a memory resident database. Using simple queries we find that database developers should (a) optimize data placement for the second level of data cache, and not the first, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction).