TL;DR: This paper provides an overview of the TBB task scheduler and discusses three manual optimizations that users can make to improve its performance: continuation passing, scheduler bypass, and task recycling, and compares its performance relative to several commercial and non-commercial allocators.
Abstract: This paper describes two features of Intel Threading Building Blocks (Intel TBB) [1] that provide the foundation for its robust performance: a work-stealing task scheduler and a scalable memory allocator. Work-stealing task schedulers efficiently balance load while maintaining the natural data locality found in many applications. The Intel TBB task scheduler is available to users directly through an API and is also used in the implementation of the algorithms included in the library. In this paper, we provide an overview of the TBB task scheduler and discuss three manual optimizations that users can make to improve its performance: continuation passing, scheduler bypass, and task recycling. In the Experimental Results section of this paper, we provide performance results for several benchmarks that demonstrate the potential scalability of applications threaded with TBB, as well as the positive impact of these manual optimizations on the performance of fine-grain tasks. The task scheduler is complemented by the Intel TBB scalable memory allocator. Memory allocation can often be a limiting bottleneck in parallel applications. Using the TBB scalable memory allocator eliminates this bottleneck and also improves cache behavior. We discuss details of the design and implementation of the TBB scalable allocator and evaluate its performance relative to several commercial and non-commercial allocators, showing that the TBB allocator is competitive with these other allocators. INTRODUCTION Performance-oriented developers now face the daunting task of threading their applications. Introducing parallelism into an application is a large investment. It is therefore imperative to implement a scalable solution, one that continues to increase performance, as the number of available cores and threads increases. Intel TBB is a C++ template library that is designed to assist developers in porting their applications to multicore platforms. The TBB library provides generic parallel algorithms [18] and concurrent containers [19] that enable users to write parallel programs without directly creating and managing threads. These algorithms are tested and tuned for the current generation of multi-core processors, and they are designed to scale as the core count continues to increase. To provide efficient performance today and continued scalability tomorrow, the library is designed to support fine-grain parallelism through tasks. Tasks are user-level objects that are scheduled for execution by the TBB task scheduler. The task scheduler maintains a pool of native threads and a set of per-thread ready pools of tasks. At initialization, the TBB scheduler creates an appropriate number of threads in the pool (by default, 1 per hardware thread) and maintains the ready pools using a randomized work-stealing algorithm [2, 3]. In this paper, we describe the design of the TBB task scheduler and several scheduling optimizations users can keep in mind while coding their applications. In the Results section, we explore the scalability of TBB applications and highlight the impact of these scheduling optimizations on performance. The task scheduler is complemented by the Intel TBB scalable memory allocator. In this paper, we provide an overview of its design and look at the tradeoffs. We compare its performance to several other commercial and non-commercial allocators. RELATED WORK The Intel TBB task scheduler is inspired by the early Cilk scheduler [2, 3]. Cilk is a parallel extension of the C programming language that defines additional keywords Intel Technology Journal, Volume 11, Issue 4, 2007 The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks 310 and constructs. The Cilk project was a descendant of the Parallel Continuation Machine (PCM)/Threaded-C [13]. Both Cilk and the Intel TBB schedule lightweight tasks onto user threads. The Chare Kernel [14] is a portable set of functions that allows users to express parallelism in terms of small tasks (chares) with the runtime transparently managing resources. Unlike Intel TBB and Cilk, however, the Chare Kernel is targeted toward message passing machines. Mainstream languages, such as those supported by the .NET CLR also recognize the need for thread pools, where users can submit tasks without the need to explicitly manage threads [15]. However, in the .NET CLR these thread pools are targeted at general-purpose applications and are not tuned for compute-intensive applications. The McRT research program at Intel presented a software prototype of an integrated runtime library for large-scale chip-level multiprocessing (CMP) platforms [17], including a highly configurable, user-level scheduler. It can be used to realize a variety of co-operative scheduling strategies, including work stealing. The design of the Intel TBB scalable allocator is based on contemporary research in scalable memory allocation [8, 9] and utilizes best-known design solutions; it has common roots with Hoard [8], LFMalloc, Vam [10], Streamflow [11] and other state-of-the-art concurrent and sequential allocators. The TBB scalable allocator is a productization of the scalable memory allocator developed as part of the McRT research program [7, 17]. THE TBB TASK SCHEDULER The Intel TBB task scheduler is a work-stealing scheduler. The design of the TBB scheduler is inspired by the early Cilk scheduler, which Blumofe and Leiserson [2, 3] proved has optimal space, time, and communication bounds for well-structured (“fully strict”) programs. In a system that uses work-stealing, each thread maintains a local pool of tasks that are ready to run. Using local pools avoids the contention that may arise with the use of a global task queue. When executed, a task performs work and also may create additional tasks that are placed in the local pool. If a thread’s pool becomes empty, it attempts to steal a task from another random thread’s pool. This approach is in contrast to static scheduling methods where threads are assigned work up-front and from other dynamic scheduling methods where a central pool of tasks (or iterations) is maintained. Blumofe and Leiserson [2, 3] showed that the expected parallel runtime of applications scheduled by the Cilk scheduler is ) ( ] [ 1 ∞ + = T P T O T E P , where 1 T is the “work” or sequential time of the application, and ∞ T is the critical path length. This optimal bound shows that as P ∞, the expected time is only limited by the critical path length (the sequential part) of the application. To achieve these same optimal bounds, the TBB task scheduler also uses a randomized work-stealing algorithm. An overview of its implementation is provided in the following section. An Overview of the Task Scheduler Design The TBB task scheduler evaluates task graphs. A task graph is a directed graph where nodes are tasks, and each node points to its parent, which is another task that is waiting on it to complete, or NULL. Each task has a refcount that counts the number of tasks that have it as their parent. Each task also has a depth, which is usually one more than the depth of its parent. The work of the task is performed by a user-defined function execute that is encapsulated within the task object. To assist in providing an overview of the Intel TBB task scheduler, we use calculation of the n Fibonacci number as a running example. A serial implementation of our Fibonacci example is shown below: long SerialFib( long n ) {
TL;DR: The evolution in packaging technology with each processor generation to meet increasing memory bandwidth needs and the revolution in package technology required for tera-scale computing needs are described.
Abstract: Tera-scale computing stresses the platform architecture with memory bandwidth being a likely bottleneck to processor performance that presents unique challenges to CPU packaging. This paper describes the evolution in packaging technology with each processor generation to meet increasing memory bandwidth needs and the revolution in package technology required for tera-scale computing needs. The scope and focus of the paper are primarily design and electrical performance challenges. We discuss a potential roadmap of transitions in package architecture and technology that evolves from today’s offpackage memory scenario to increasingly complex onpackage integrated memory architectures. An overall treatment of memory hierarchy, including off-die memory approaches, is not within the scope of this paper, but relevant to the overall challenge of enabling higher bandwidth. Again, the focus of this paper is on the CPU package itself. In this context, we discuss the memory bandwidth limitations, technology challenges, and tradeoffs of each package architecture. INTRODUCTION With a potential transition to tera-scale computing with multiand many-core microprocessors and integrated memory controllers on the CPU, memory bandwidth becomes a bottleneck to processor performance [1]. This presents unique challenges to CPU packaging. Previous memory bandwidth requirements have scaled steadily, but fairly slowly, from one microprocessor generation to the next. This has driven a fairly steady but slow increase in pin count growth for chipset packages, which have traditionally provided the link to system memory between the microprocessor and memory modules. With a transition to multiand many-core architectures, however, there is a large increase in the memory bandwidth requirement. This transition occurs at the same time as a shift to an integrated memory controller architecture for the CPU. These fairly simultaneous architecture transitions result in a tremendous burden on CPU packaging requirements, driving pin count growth and driving up routing density due to the large increase in interconnects that must be routed from the CPU through the package to off-package memory modules. In this paper we describe the evolution in packaging technology with each processor generation to meet increasing memory bandwidth needs. We focus on the revolution in package technology required for tera-scale computing needs. The scope and focus of this paper are primarily design and electrical performance challenges. We propose a roadmap of transitions in package architecture and technology that evolves from today’s offpackage memory to increasingly complex on-package integrated memory architectures. We discuss the memory bandwidth limitations, technology challenges, and tradeoffs of each package architecture. In the first section of this paper we look at memory bandwidth fundamentals. Next, we review the past trends in memory bandwidth requirements and the package technology impact. We follow this with sections describing the memory bandwidth needs for tera-scale computing and the resulting package technology impact and response. MEMORY BANDWIDTH FUNDAMENTALS It is useful to review several fundamental concepts as an introduction to the topic of memory bandwidth. First, it is important to understand the definition of memory bandwidth, the key elements related to bandwidth, and the role that the package interconnect plays. Very basically, memory bandwidth is defined as the product of the Intel Technology Journal, Volume 11, Issue 3, 2007 Package Technology to Address the Memory Bandwidth Challenge for Tera-scale Computing 198 number of data bits in the memory bus and the speed of a single bit in the bus. This can be expressed as BW = # of bits x bit rate Eq. (1) For example, if a memory bus is 8 bits wide (or 1 byte wide) and each bit transmits data at 1Gb/s (gigabits per second), then the memory bandwidth is 1 byte (1B) x 1Gb/s, or 1GB/s. A more realistic example is that of a typical DDR2 bus that is 16 bytes (128 bits) wide and operating at 800Mb/s. The memory bandwidth of that bus is 16 bytes x 800Mb/s, which is 12.8GB/s. Besides the actual memory bandwidth, other key elements of memory bandwidth are latency and capacity. Latency is the roundtrip time that it takes to receive a response after a request has been sent. Latency is typically measured in nanoseconds (ns). Capacity refers to the size of the memory and is typically measured in MBs. The memory subsystem hierarchy of a computer architecture consists of many levels. Memory can be located at the chip level, the package level, the board level, and in separate devices off the board (such as the hard disk). There is a tradeoff among the types and the key elements of memory (bandwidth, latency, and capacity) depending upon the location in the memory subsystem hierarchy. Very simply, faster, lower capacity memory is typically located on-chip, while slower, higher capacity memory is located off-chip. On-chip memory usually uses Static Random Access Memory (SRAM) technology, which is fast but expensive, and it is lowdensity compared to other memory technologies. On-chip memory usually serves as a cache and can be further divided into levels of cache, e.g., L1 cache, L2 cache, etc., [2]. Off-chip memory typically uses Dynamic Random Access Memory (DRAM) technology, which is slower but cheaper, and it is higher-density than SRAM. Off-chip memory located on the system board serves as the main memory for the computer system. Today’s typical computer architecture consists of the microprocessor (CPU), the chipset, and the main memory. Busses connect the various components of the system. Figure 1 illustrates a typical system architecture consisting of a microprocessor connected to a chipset through the system bus. The chipset in this example is divided into a Memory Controller Hub (MCH) and a separate Graphics Processing Unit (GPU). Each has a memory bus connecting to on-board memory. The system bus connects the CPU to the on-board, main system memory. System Memory System Bus (FSB) CPU
TL;DR: This paper describes how Ct is designed for minimal effort by the developer, while providing forward scaling on multi-core IA, and describes how a sampling of key application spaces can be easily written using Ct to achieve high performance.
Abstract: Developers face new challenges with multi-core software development. The first of these challenges is a significant productivity burden particular to parallel programming. A big contributor to this burden is the relative difficulty of tracking down data races, which manifest nondeterministically. The second challenge is parallelizing applications so that they effectively scale with new core counts and the inevitable enhancement and evolution of the instruction set. This is a new and subtle change to the benefit of backwards compatibility inherent in Intel Architecture (IA): performance may not scale forward with new micro-architectures and, in some cases, may regress. We assert that forward-scaling is an essential requirement for new programming models, tools, and methodologies intended for multi-core software development. We are implementing a programming model called the Ct API that leverages the strengths of data parallel programming to help address these challenges of multicore software development. In this paper we describe how Ct is designed for minimal effort by the developer, while providing forward scaling on multi-core IA. We describe how Ct’s design and implementation evolved from the initial prototype, based on co-traveler feedback, and we provide examples of how Ct can be used. We demonstrate how a sampling of key application spaces can be easily written using Ct to achieve high performance. Finally, we discuss how these ideas can be transitioned into mainstream software development tools.
TL;DR: The contractual and competitive aspects of creating investment and joint development programs, with the ultimate goal of improving the probability of success in delivering the right technology at the right time in high volume, are discussed.
Abstract: How do we decide to make strategic bets on multiple, sometimes competing technologies across a portfolio of technology options to maximize our potential for success? Ideally, we can minimize risk by investing in technologies that enable multiple competing technology options; however, not all critical capabilities fall into this category. Investment in orthogonal options must be judicious, as high-risk, high-reward, long lead-time developments will likely also be high cost. In some cases, these larger investments may enable the desired option or a competing option. As long as at least one technology option is available when needed, the investment is ultimately successful. Finally, there may be unique capabilities that may be under-funded, where a nominal investment can enable a technical linchpin. In this paper, we examine a method to make these strategic bets in the lithography supply chain. We start by looking at a system to assess technical and business risk for all components of the supply chain as they evolve over time. We discuss a methodology for identifying fellow travelers, including consortia, to create programs to establish a foundation of common technologies. We discuss the contractual and competitive aspects of creating investment and joint development programs, with the ultimate goal of improving our probability of success in delivering the right technology at the right time in high volume.
TL;DR: This paper optimize and parallelize a set of typical visual feature extraction applications in CBVIR, representative of those used in video-analysis applications and can be further used in other applications to maximally improve their performance on multi-core systems.
Abstract: With the explosive increase in video data, automatic video management (search/retrieval) is becoming a mass market application, and Content-Based Video Information Retrieval (CBVIR) is one of the best solutions. Most CBVIR systems are based on low-level feature extractions guided by the MPEG-7 standard for high-level semantic concept indexing. It is well known that CBVIR is a very compute-intensive task, and the low-level visual feature extractions are the most timeconsuming components in CBVIR. Nowadays, with the multi-core processor becoming mainstream, CBVIR can be accelerated by fully utilizing the computing power of available multi-core processors. In this paper, we optimize and parallelize a set of typical visual feature extraction applications in CBVIR. The underlying optimization and parallel techniques are representative of those used in video-analysis applications and can be further used in other applications to maximally improve their performance on multi-core systems. We conduct a detailed performance analysis of these parallel applications on a dual-socket, quad-core system. The analysis helps us identify possible causes of bottlenecks, and we suggest avenues for scalability improvement to make those applications more powerful in real-time performance.
TL;DR: The new profiling capabilities available in the Intel Performance Tuning Utility are introduced, including statistical call tree analysis based on stack sampling, profile-guided loop detection, and eventbased sampling data access profiling.
Abstract: While multi-core processors are all around us, their effective use is made much easier with performance analysis tools that enable the developer to identify parallel execution opportunities and parallel execution bottlenecks. In this paper we introduce the new profiling capabilities available in the Intel Performance Tuning Utility. These include statistical call tree analysis based on stack sampling, profile-guided loop detection, and eventbased sampling data access profiling. The coordinated use of these features allows the developer to achieve better multi-core application performance.
TL;DR: This paper presents several media-mining applications that require target architectures capable of delivering tera-scale computing, and presents several different parallel schemes and a general parallel video-mining framework to abstract various parallelisms.
Abstract: With the exponential increase in media data on personal computers and the Internet, it is critical for end users to efficiently manage metadata to find the information they are looking for. Media mining refers to a technique whereby a user can retrieve, organize, and manage media data. However, most media-mining applications are compute intensive, and they require tera-operations per second. This paper focuses on how tera-scale computing enables new usage models with media-mining techniques. Several representative media-mining usage examples are explored in detail. First, we look at how these new usage models are enabled by a different kind of parallelism. For maximum performance, we provide a general parallel framework to abstract various parallelisms. We also present a detailed architectural performance analysis of several representative workloads on a dual-socket, quad-core system and on a 32-core Chip Multiprocessor (CMP) simulator. The results indicate that these media-mining applications have no obvious limits on concurrency and are amenable to future large-scale, multi-core architectures. They can take full advantage of tera-scale computing power in the form of thread-level parallelism to meet users’ needs. Because the underlying techniques and fundamental algorithms in media mining are widely used in other applications, many of our findings are applicable to other emerging applications as well. INTRODUCTION Rapid advances in the hardware technology of media capture, storage, and computation power have contributed to an amazing growth in digital media content. As content generation and dissemination grows, extracting meaningful knowledge from large amounts of multimedia data becomes increasingly important. Media mining is a kind of technology that helps end users search, browse, and manage large amounts of multimedia data [1]. It yields a wide range of emerging applications with various mass-market segments, e.g., image/video retrieval, video summarization, scene understanding, visual surveillance, digital home entertainment, smart health care, etc. Most of these applications are very complicated and have real-time or even super-real-time processing demands, which require tera-scale computing power to make them usable. In this paper, we present several media-mining applications that require target architectures capable of delivering tera-scale computing. Our study shows that today’s single-core processor system performance is 10x–1000x slower for acceptable human interactions. To accelerate these compute-intensive applications, we exploit the inherent data and function parallelism of these workloads. Our experiments show that with proper parallelization, these workloads can scale well, achieving a speedup of up to 7.5x on a 2-socket, quad-core machine and a speedup of up to 30x on a 32-core CMP simulator. Intel Technology Journal, Volume 11, Issue 3, 2007 Media Mining—Emerging Tera-scale Computing Applications 240 This paper is organized as follows. First, we explore several media-mining usage models and their key techniques. Next, we present several different parallel schemes and a general parallel video-mining framework. Then, we show our performance analysis results of the parallelized workloads. MEDIA-MINING APPLICATIONS Media mining has a huge number of emerging applications with different usage models. We highlight three typical usage models developed at Intel. Media-Mining Usage Models • Sports video analysis: Broadcast sports videos are very popular on television. Using highlights detection, consumers can quickly retrieve specific video clips without having to browse through the whole video. Sports video analytics can be viewed from the perspective of an editor. Based on a predefined semantic intention, an editor combines certain multimedia content elements and their temporal layout to achieve the desired highlighted events. Hence, detecting highlighted events is similar to a reverse process of authoring. The system framework consists of three levels: low-level audio/visual feature extraction, mid-level semantic keywords generation, and high-level event detection [8]. To minimize the semantic gap between low-level features and high-level events, we use mid-level semantic “keywords” followed by a classifier to infer events of interest. Our sports video analysis system can work with a multitude of sports including soccer, hockey, badminton, tennis, and diving. Given a video in a specific domain with predefined semantic intentions, the system can extract the desired events and features and interpret a summarization output video in terms of high-level semantics. • Personal video editing: Home videos are increasingly popular as digital video cameras become more user friendly and portable. However, because home videos for the most part are shot by amateurs, shaking, blurring, under-exposure artifacts, and redundant content are always present. Therefore, the demand for an automated home video editing system [2] is high. Such a system has to be able to recognize how many people and how many scenes are involved, mine the relationship between various people and scenes, and synthesize a short artistic video clip from a long raw video. A typical personal video editing system includes three key modules: intelligent analysis, adaptive selection, and seamless composition. The first module extracts the multi-modal and multi-level audio-visual features; the second module selects the most interesting, important, and informative content; and the third module produces a near-professional story with incidental music. The overall automated home video editing system must be easily extended to the personal video recorder and digital home entertainment system. • Personal video retrieval: A personal video retrieval system is a desktop application that works much like the Google desktop search to help end users manage more and more personal multimedia data from all kinds of mobility digital camera devices. In response to a user query, the personal video retrieval application finds the relevant video clips from a large video database such as from movies, TV, sports games, and home videos. Generally, a retrieval system first extracts low-level audio/visual features from videos, and then detects semantic concepts (keywords) to represent the video content. Finally, a query engine returns retrieval results based on the user’s query and on a similarity model. The query can be text keywords, image examples, hand-drawn sketches, or short video clips, and the output is relevant video clips ranked not only by their content similarity to the query, but also by their importance, according to a concept-link relationship analysis. To gradually improve system performance during the query procedure, the system provides user-friendly relevant feedback and active learning modules. Key Media-Mining Techniques Although the above usage models are quite different from one another, the underlying technologies are common and can be extended to a broad range of media-mining applications. In this paper, four key techniques are extracted from previous usage models to show how media-mining applications are built. • Sports keyword detection: The mid-level module generates semantic “keywords” from the previously described low-level extraction. Listed below are some keywords in sports video analysis. These keywords are used as input for high-level event detection. View type: Based on color histograms of each frame, we can obtain the dominant color to segment the playing field region. We then classify each frame as a global view, medium view, close-up view, and out of view [5]. Play-field: A Hough transform from digital image processing is used to detect field boundaries and penalty box sections. Then a decision-tree-based classifier determines the play Intel Technology Journal, Volume 11, Issue 3, 2007 Media Mining—Emerging Tera-scale Computing Applications 241 position according to the slope and position of the lines. Replay: In broadcast sports videos, to capture clues for significant events, there typically is a replay following an important event. At the beginning and end of each replay, there is generally a logo flying in high speed. We detect logos to identify replays by discovering repeat video segments through dynamic programming [6]. Audio keywords: There are two types of audio keywords: commentator’s excited speech and referee’s whistle: these have a strong correlation to key events in the game such as a foul, a goal, or player entanglements. A Gauss Mixture Model (GMM) is used to detect keywords from low-level audio features including Mel frequency Cepstral coefficients (MFCC), energy, and pitch [7]. • Human detection and tracking: Human detection and tracking is a significant and challenging task in many application scenarios. Different from rigid objects, humans are articulated and jointed by several human-parts, which may lead to pose variance, self-occlusion, etc. In human detection, the first problem is to select the proper features to characterize human regions/parts: Haar wavelets [3] and orientation histograms are mostly used to do this. The second problem with human detection is to use a discriminator to determine whether there are humans and where they are if they are present. The Boosting learning-based detector is preferred [3]. It is an aggressive learning algorithm that produces a strong classifier by choosing features in a family of simple classifiers and combining them linearly. Then a cascaded structure is introduced in order to quickly reject the background regions. Human tracking is essentially finding body regions or parts that correspond with successive frames by using data association and occlusion inference techniques. • Face detection and tracking: Face detection and face tracking have been an important technology and pre-requirement for many person-analysis relevant applications, such as face recognition/identificat
TL;DR: This paper discusses some of the methods used to improve performance that largely focus on cache utilization and minimization of table look-aside buffer (TLB) misses and discusses how this concept of ease of use will be expanded to provide more flexibility in the use of the library without greatly expanding its size.
Abstract: In this paper we present the Intel Math Kernel Library (MKL) as a mathematical software package for scientific and technical computation designed for ease of use in environments that can vary greatly. Ease of use includes the build environment (use with different compilers), optimal performance on multiple platforms (automated selection of code based on the end-user system), optimal performance (optimization of an algorithm), interfaces to other libraries (FFTW), and effective use of multi-core processors through parallelization. We also discuss how this concept of ease of use will be expanded to provide more flexibility in the use of the library without greatly expanding its size. Much of the paper is devoted to the optimization and parallelization of the library, critical in this era of multicore processors. We discuss some of the methods used to improve performance that largely focus on cache utilization and minimization of table look-aside buffer (TLB) misses. Specifically, we look at the parallel performance of Basic Linear Algebra Subroutines [3] (BLAS), LAPACK [1], the Vector Math Library (VML), and a sparse linear solver (PARDISO). We include a brief section on a second application library, Integrated Performance Primitives (IPP), which complements the MKL in media applications. INTRODUCTION The Intel Math Kernel Library (MKL) is a math library for use in scientific and engineering applications supporting a number of different mathematical areas: Linear algebra. Basic Linear Algebra Subroutines (BLAS), LAPACK, ScaLAPACK, sparse BLAS, iterative sparse solvers, preconditioners, direct sparse solver (PARDISO) Signal processing. FFTs, cluster FFTs Vector math. Vector Math Library Statistics. Vector Statistics Library with random number generators PDEs. Poisson, Helmholtz solvers, trigonometric transforms Optimization. Trust region solvers Other. Interval linear solvers, multi-precision integer arithmetic Among the key guidelines for the development of the library are using optimized math software for computationally demanding algorithms; threading and parallelizing these algorithms to make full use of multiprocessor, multi-core [2], and multi-computer systems, making the library easy to use, and maintaining a high quality. Our focus in this paper is mostly on performance but we also introduce the paper with a discussion on ease of use. A number of the features of the library do not relate to math functionality but contribute to ease of use. Some of these are: • Designing the library to be compiler-independent eliminates the need for compiler-specific versions and allows C language programs to link to the Fortran portions of the library without the usual Fortran runtime libraries. Perhaps it is more correct to state that all compiler dependencies have been isolated (as will be explained in the discussion of the layer model of the library). Intel Technology Journal, Volume 11, Issue 4, 2007 Intel Performance Libraries: Multi-Core-Ready Software for Numeric-Intensive Computation 300 • Providing competitive performance on non-Intel processors so software vendors can use a single library in their products for Intel architecture computers. • Parallelizing those parts of the library where parallelization makes sense. Most of the library functions could be parallelized but would not improve in performance if parallelized. Most of this paper deals with parallel performance on multi-core processors. • Using interface files to map FFTW to MKL FFTs, other files to map older MKL FFTs to the more recent FFTs as well as using Java interface examples for various parts of the library. To further enhance usability, future versions of MKL will introduce a “layer model” (see Figure 1). This version will have four layers: interface, threading, computational, and run-time, or compiler-specific, library layer. The first layer already exists for the 32-bit Windows* version but will be ubiquitous in the library. This layer allows MKL to accommodate different interfaces, including, for instance, gfortran. This and some other Fortran compilers handle complex return values differently than the Intel compiler for the Intel 64 Architecture-based processors on Linux*. This difference can be dealt with through an interface file without duplicating the rest of the library. Similarly, the basic library for a 64-bit operating system (OS) will use 64-bit integers going forward, but LP64 (32-bit integers for a 64bit OS) will be accommodated with a layer. An area that has been problematic, and will be more difficult going forward, has been the intermingling of user threaded code with MKL, where the user’s program is compiled with a non-Intel compiler. The second layer deals with this mismatch. All MKL threading is function based, so the threaded portion will be compiled with different compilers (Intel and gfortran, for instance) and the threaded portion provided as a layer. By turning threading off during compilation of the threaded software, a non-threaded layer will create a sequential version of the library. By linking in the appropriate threaded layer, multiple threading environments will be supported, including a sequential version of the library, with just a small increase in the size of the package. The third layer is the computational layer. This layer does all the computations and includes processor-specific code that is chosen at run time. The fourth layer contains support files such as libguide, the threading library for Intel compilers, and the BLACS, which are specific to compilers and message passing interface (MPI) versions. Figure 1: Layer model for MKL In the rest of this paper we focus on performance for multi-core processors. Fortunately, many of the methods needed to achieve scaling with multi-core processor systems are similar to those used in shared memory parallel systems, at least for many of the functions of MKL. However, because of the shared caches of multicore processors there are additional opportunities for threading functions such as VML, as explained in one of the performance sections. We discuss parallelization and optimization for several different areas supported by the Intel libraries in this order: BLAS, LAPACK, sparse linear solvers, VML, and codecs from IPP. Other key functions such as FFTs are not discussed. Especially in the cases of the BLAS and LAPACK, the contribution of the MKL developers is to take extant code and optimize it, including parallelizing it where that makes sense. The fundamental problem for much mathematical software is how to structure the problem in such a way that the caches can be effectively used. Before looking at these problems it is useful to look at the problem from a data consumption versus data supply rate point of view. Consider the Intel CoreTM2 Duo processor, with a dual core running at 3.0 GHz performing the dot product. If we assume that one vector can be kept in cache, at what rate must the memory system supply data to keep just one dual-core processor busy? Each processor can do two double-precision multiplies per clock or four multiplies per clock, requiring 32 bytes (8 bytes per double precision word) per clock. At 3 GHz, this is 96 GB/second. For a dual-socket system (Woodcrest) the system must provide 192 GB/s to keep all four cores busy. On a Clovertown system the number of cores doubles again and the demand, at the same frequency, goes to 384 GB/s. Layer Examples
TL;DR: The design and implementation of a new threadizer and vectorizer inside the Intel 10.1 compilers are outlined and an overview of the enhanced high-level loop optimizations and the low-level code generation used to obtain higher performance on platforms based on Intel Core 2 Duo and Quad processors are provided.
Abstract: The fast introduction of the Intel CoreTM2 Duo and Quad processors to the mass market has drawn attention to threadization (a.k.a. parallelization) and vectorization of the existing code in many application domains. In fact, multi-core processor vendors are eager to enable their users to exploit various levels of parallelism in order to harness the additional compute resources of multi-core processors. The Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel Core 2 Duo and Quad processors. This is accomplished by means of high-level loop optimizations and scalar optimizations to exploit multi-core processors and single-instructionmultiple-data (SIMD) instructions, combined with advanced code generation, that is built on an intimate knowledge of micro-architectural performance aspects. In this paper we outline the design and implementation of a new threadizer and vectorizer inside the Intel 10.1 compilers, and we also provide an overview of the enhanced high-level loop optimizations and the low-level code generation used to obtain higher performance on platforms based on Intel Core 2 Duo and Quad processors. Significant performance gains are shown using the SPEC CPU2006 suite running on a system configured with two Intel quad-core processors. INTRODUCTION The aggressive delivery of Intel multi-core processors to the mass computer market shows that, as the performance improvements from continuously increasing clock frequencies start to taper off, other architectural advances that reduce latency or increase memory bandwidth are gaining importance [9]. In particular, since packaging densities are still growing, integrating multiple processors on a single die and using SIMD extensions are becoming more widespread [1]. The Intel Core 2 Duo and Quad processors are equipped with a rich set of microarchitectural and architectural features to boost performance: • dual-core or quad-core on a single chip • wider execution units for Streaming SIMD Extensions (SSE, SSE2, SSE3) • a set of new instructions referred to as Supplemental Streaming SIMD Extensions 3 (SSSE3) • advanced smart shared L2 cache among cores on the same chip Due to the complexity of modern processors, compiler support has become an important part of obtaining higher performance. Most importantly, to assist programmers in leveraging all parallel capabilities of Intel’s new processors, the Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel multi-core processors and SIMD instructions by means of high-level optimizations and advanced code generation. The Intel compilers perform automatic optimizations of programs using threadization [10], vectorization [1, 2, 5], classical loop transformations (e.g., distribution, unrolling, interchange, fusion) [7, 11, 12], scalar optimizations such Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 264 as constant propagation, Partial Dead Store Elimination (PDSE), Partial Redundancy Elimination (PRE), copy propagation, Inter-Procedural Optimizations (IPO) [7], and advanced machine code generation techniques that together yield a significant performance gain compared to the default level of optimization. The contributions of the new threadizer and vectorizer are as follows: • The new threadizer yields up to 4.63x speedup (with 8 cores) by exploiting thread-level parallelism from a serial program in the SPEC CPU2006 benchmark suites. Overall, the auto-threadization delivers a 15.45% gain (geomean with 8 cores) for SPEC CFP2006 suite and a 12.17% gain (geomean with 8 cores) for SPEC CINT2006 suite. • The new vectorizer yields up to 1.28x performance speedup by exploiting SIMD-type vector parallelism from a serial program in the SPEC CPU2006 suites. Overall, the auto-vectorization delivers a 5.11% gain (geomean) for SPEC CFP2006 suite and a 2.01% gain (geomean) for SPEC CINT2006 suite. The rest of this paper is organized as follows. First, we provide some basics on the Intel CoreTM microarchitecture. Then, we discuss the design and implementation of the new threadizer and vectorizer, respectively, inside the Intel 10.1 compilers. Subsequently, we discuss the loop optimizations and enhancements made to support efficient threadization and vectorization. We also present an overview of advanced code generation for the Intel Core 2 Duo and Quad processors. Finally, we provide performance results using the SPEC CPU2006 industry-standard benchmark suite built with the Intel 10.1 C++ and FORTRAN compilers. INTEL CORETM MICROARCHITECTURE Intel Core micro-architecture is the foundation for all new Intel architecture-based desktop, mobile, and server multi-core processors. This state-of-the-art multi-core processor with optimized micro-architecture delivers a number of innovative features that have set new standards for energy-efficient performance. In this section we outline a few innovations relevant to this paper. A more detailed description can be found in the Intel literature [4]. Figure 1: Quad-core processor schematic Figure 1 shows a schematic of the Intel Core 2 Quad processor. Two independent cores with their own private L1 caches reside on a single die. Two shared Level 2 (L2) caches, referred to as the Intel Advanced Smart Cache, work by sharing the L2 cache between cores so that data are stored in one place accessible by the cores. Sharing the L2 cache enables a core to dynamically use up to 100% of the available L2 cache, thus optimizing cache resources. The quad-core processor is equipped with Intel Smart Memory Access techniques that boost system performance by optimizing available data bandwidth from the memory subsystem and hiding the latency of memory accesses through two techniques: memory disambiguation and an instruction pointer-based prefetcher that fetches memory contents to the shared L2 cache and then into each private L1 cache before they are requested. The data prefetcher can detect strided memory access patterns to make accurate predictions about future load addresses. Another key feature of Intel Core micro-architecture is the Intel Advanced Digital Media Boost that can issue 128bit SSE instructions with a throughput of one per clock cycle. Previous-generation Intel processors had a sustained throughput of one instruction per two clock cycles, typically one cycle for the lower 64 bits followed by another cycle for the upper 64 bits. By widening execution units to the full 128 bits, the Intel processor effectively doubles the performance of a series of 128-bit SSE instructions relative to previous-generation Intel processors. In addition, the latency of various individual 128-bit SSE instructions has been reduced, and SSSE3 has been added to extend the instruction set. As a result, more overall performance improvements can be expected from vectorization (i.e., transforming sequential code into SIMD instructions). REVAMPING THE THREADIZER In this section, we present our new threadizer framework that is highly integrated with our classical high-level loop optimizations, and we describe its main components. The strengths of the new threadizer include the following: • A new Abstract Thread Representation (ATR), based on the concept of virtual threads, is designed to bridge the semantic gap between high-level representation and physical (hardware or OS) threads. • Better interaction with other high-level loop-related optimizations gives better performance. • The new threadizer is moved downstream to take advantage of scalar optimizations such as global constant propagation and Single-Static-Assignment (SSA) PRE, and some loop optimizations. • A table-driven cost model simplifies maintenance and future extensibility. Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 265 • Effective runtime threadization control and multiple schedule types such as static, dynamic, guided, and runtime are supported. The threadizer in the Intel compiler serves as a single module that covers different languages (C++ and Fortran), architectures (IA-32, Intel 64, and IA-64), and operating systems (Microsoft Windows*, Linux*, and MacOS*).
TL;DR: By threading and tuning a typical multiple pattern matching algorithm, this work shows how to apply parallel principles during each phase of a generic development cycle while utilizing Intel Threading Analysis Tools to pinpoint the bottlenecks and threadsafety errors and to improve overall performance.
Abstract: While multi-core processors are designed for greater performance with optimal power consumption, the parallel algorithm design and software development that is needed to maximize the performance potential of multi-core systems are much more complicated than those associated with serial computing. Even though parallel computing has long been studied by researchers, there is no general framework to implement parallel programming for different software applications. Programmers face three immediate challenges when applying parallelism to software development: scalability, correctness, and maintainability. Applicable parallel methodology and new software development tools are greatly needed by programmers working in this environment. To keep software applications in sync with the multi-core processors that are becoming mainstream in the marketplace, Intel provides a whole set of threading software products. A multiple pattern matching algorithm is the core algorithm of the detection engine in the rule-based Intrusion Detection System (IDS). Most of the research on improving the performance of this algorithm is based on serial computing. Actually, the performance of the algorithm can be improved greatly through parallelization on multi-core systems. By threading and tuning a typical multiple pattern matching algorithm, we show how to apply parallel principles during each phase of a generic development cycle while utilizing Intel Threading Analysis Tools to pinpoint the bottlenecks and threadsafety errors and to improve overall performance. Moreover, we specifically compare the implementation method and performance gain of the Windows* Threading API against that of Intel Threading Building Blocks (Intel TBB), when implementing the parallel multiple pattern matching algorithm with the experimental performance data presented. INTRODUCTION As multi-core processors become mainstream in the market place, software needs to be parallel to take advantage of multiple cores. However, there is no general framework available to implement parallel programming for different applications to achieve the highest performance gain. Generally implemented with multithreading, parallel programming is notoriously difficult for developers to design, implement, and debug. In order to make life easier for developers, Intel provides a set of threading tools targeting various phases of the development cycle. In a generic development cycle, program development can be divided into four phases [1]: • Analysis phase: Profiling the serial version of the program to determine the areas that are suitable for parallel decomposition. • Design/implementation phase: Examining identified threading candidates, determining the changes that have to be made to the serial version, and converting them to the actual code. • Debug phase: Ensuring the correctness of the program. Detecting and solving common threading errors such as data race and deadlocks. • Testing/tuning phase: Validating the correctness of the program and testing its performance. Detecting performance issues and fixing them by improved design or by eliminating bottlenecks. Intel’s threading tools provide aids for developers from performance analysis to implementation and debugging: • Intel VTuneTM Performance Analyzer [4]. This tool helps developers tune an application to better perform on Intel architectures. It locates the performance bottlenecks and program hotspots by collecting, sampling, and displaying system-wide data down to Intel Technology Journal, Volume 11, Issue 4, 2007 Parallel Software Development with Intel Threading Analysis Tools 288 specific functions, modules, or instructions. It is usually used during the analysis and tuning phase of the development cycle. • Intel Thread Profiler [5]. This tool helps to identify performance bottlenecks in Win32* and OpenMP* threaded software. It detects threading performance issues such as thread overhead and synchronization cost. The profiler is usually used in the tuning phase. • Intel Thread Checker [6]. This tool helps to find bugs in Win32 and OpenMP threaded software. It locates threading issues such as race conditions, thread stalls, and potential thread deadlocks. The Intel Thread Checker is usually used during the design and debugging phases. • Intel Threading Building Blocks (Intel TBB) [7]. This is a threading abstraction library that provides highlevel generic implementation of parallel patterns and concurrent data structures [2]. Intel TBB is usually used in the design, implementation, and tuning phases. In the sections that follow, we first introduce the principles of parallel application design; then we show how to parallelize an application with the help of threading tools during each phase of the development cycle. A multiple pattern matching algorithm is used as an example. We use the Win32 threading API and Intel TBB to implement the parallelism, and we compare the performance of the two. PRINCIPLES OF PARALLEL APPLICATION DESIGN Decomposition Techniques Dividing a computation into smaller computations and assigning these to different processors for execution are two key steps in parallel design. Two of the most common decomposition techniques [3] are functional decomposition and data decomposition [2]. • Functional decomposition is used to introduce concurrency in the problems that can be solved by different independent tasks. All these tasks can run concurrently. • Data decomposition works best on an application that has a large data structure. By partitioning the data on which the computations are performed, a task is decomposed into smaller tasks to perform computations on each data partition. The tasks performed on the data partitions are usually similar. There are different ways to perform data partitioning: partitioning input/output data or partitioning intermediate data. Parallel Models These are some of the commonly used parallel models [3]. • Data parallel model. This is one of the simplest parallel models. In this model, the same or similar computations are performed on different data repeatedly. Image processing algorithms that apply a filter to each pixel are a common example of data parallelism. OpenMP is an API that is based on compiler directives that can express a data parallel model. • Task parallel model. In this model, independent works are encapsulated in functions to be mapped to individual threads, which execute asynchronously. Thread libraries (e.g., the Win32 thread API or POSIX* threads) are designed to express task-level concurrency. • Hybrid models. Sometimes, more than one model may be applied to solve one problem, resulting in a hybrid algorithm model. A database is a good example of hybrid models. Tasks like inserting records, sorting, or indexing can be expressed in a task-parallel model, while a database query uses the data-parallel model to perform the same operation on different data.
TL;DR: This paper reports on a success story: threading the Intel C++ Compiler which resulted in an average 2x speedup in compiling a range of CPU2000 benchmarks, and presents the methodology and tools that enabled it.
Abstract: Multi-core processors are now mainstream, while manycore architectures are arriving. Yet getting generalpurpose software ready to take full advantage of the available hardware parallelism remains a challenge. There are, in fact, very few success stories of semi-automatic parallelization of large-scale integer applications outside the high-performance computing (HPC) and transaction processing domains. In this paper, we report on such a success story: threading the Intel C++ Compiler [3] which resulted in an average 2x speedup in compiling a range of CPU2000 benchmarks. We present the methodology and tools that enabled us to achieve this success. We believe our approach is generally applicable to threading a large class of applications.
TL;DR: In this article, the authors describe the role of risk transfer and mitigation in the overall risk management process, focusing on the identification, analysis, and control of hazard loss risks in semiconductor manufacturing environments.
Abstract: Risks to physical assets from hazard events are omnipresent. Hazard risks include perils such as fire, explosions, floods, windstorms, earthquakes, typhoons, etc. Intel has unique risks related to clean room environments: it runs an ultra-clean, pristine work environment with sensitive, specialized high-value equipment, producing high volumes of product at nano geometries with uncompromising quality. The Risk Management Process for hazard risk evaluation is a disciplined approach that consists of identification, control, transfer, and mitigation. In this paper we describe how this process is applied with exceptional results in establishing specific studied controls to address the hazards of fire, flood, and windstorm. The result is the mitigation of the consequences of these risks within semiconductor manufacturing facilities. Combined with the Business Continuity and Emergency Response programs addressed elsewhere in this issue of the Intel Technology Journal, this hazard identification and mitigation process reflects one of the crucial pieces of an integrated approach to managing operational risks. Challenges and solutions are discussed. While we define the role of risk transfer and mitigation in the overall Risk Management Process, the focus of this paper is on the identification, analysis, and control of hazard loss risks.