Intel Technology Journal (Intel Corp) | 99 Publications | 1008 Citations | Top authors

The Foundations for Scalable Multicore Software in Intel Threading Building Blocks

[...]

16 Nov 2007-Intel Technology Journal

TL;DR: This paper provides an overview of the TBB task scheduler and discusses three manual optimizations that users can make to improve its performance: continuation passing, scheduler bypass, and task recycling, and compares its performance relative to several commercial and non-commercial allocators.

...read moreread less

Abstract: This paper describes two features of Intel Threading Building Blocks (Intel TBB) [1] that provide the foundation for its robust performance: a work-stealing task scheduler and a scalable memory allocator. Work-stealing task schedulers efficiently balance load while maintaining the natural data locality found in many applications. The Intel TBB task scheduler is available to users directly through an API and is also used in the implementation of the algorithms included in the library. In this paper, we provide an overview of the TBB task scheduler and discuss three manual optimizations that users can make to improve its performance: continuation passing, scheduler bypass, and task recycling. In the Experimental Results section of this paper, we provide performance results for several benchmarks that demonstrate the potential scalability of applications threaded with TBB, as well as the positive impact of these manual optimizations on the performance of fine-grain tasks. The task scheduler is complemented by the Intel TBB scalable memory allocator. Memory allocation can often be a limiting bottleneck in parallel applications. Using the TBB scalable memory allocator eliminates this bottleneck and also improves cache behavior. We discuss details of the design and implementation of the TBB scalable allocator and evaluate its performance relative to several commercial and non-commercial allocators, showing that the TBB allocator is competitive with these other allocators. INTRODUCTION Performance-oriented developers now face the daunting task of threading their applications. Introducing parallelism into an application is a large investment. It is therefore imperative to implement a scalable solution, one that continues to increase performance, as the number of available cores and threads increases. Intel TBB is a C++ template library that is designed to assist developers in porting their applications to multicore platforms. The TBB library provides generic parallel algorithms [18] and concurrent containers [19] that enable users to write parallel programs without directly creating and managing threads. These algorithms are tested and tuned for the current generation of multi-core processors, and they are designed to scale as the core count continues to increase. To provide efficient performance today and continued scalability tomorrow, the library is designed to support fine-grain parallelism through tasks. Tasks are user-level objects that are scheduled for execution by the TBB task scheduler. The task scheduler maintains a pool of native threads and a set of per-thread ready pools of tasks. At initialization, the TBB scheduler creates an appropriate number of threads in the pool (by default, 1 per hardware thread) and maintains the ready pools using a randomized work-stealing algorithm [2, 3]. In this paper, we describe the design of the TBB task scheduler and several scheduling optimizations users can keep in mind while coding their applications. In the Results section, we explore the scalability of TBB applications and highlight the impact of these scheduling optimizations on performance. The task scheduler is complemented by the Intel TBB scalable memory allocator. In this paper, we provide an overview of its design and look at the tradeoffs. We compare its performance to several other commercial and non-commercial allocators. RELATED WORK The Intel TBB task scheduler is inspired by the early Cilk scheduler [2, 3]. Cilk is a parallel extension of the C programming language that defines additional keywords Intel Technology Journal, Volume 11, Issue 4, 2007 The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks 310 and constructs. The Cilk project was a descendant of the Parallel Continuation Machine (PCM)/Threaded-C [13]. Both Cilk and the Intel TBB schedule lightweight tasks onto user threads. The Chare Kernel [14] is a portable set of functions that allows users to express parallelism in terms of small tasks (chares) with the runtime transparently managing resources. Unlike Intel TBB and Cilk, however, the Chare Kernel is targeted toward message passing machines. Mainstream languages, such as those supported by the .NET CLR also recognize the need for thread pools, where users can submit tasks without the need to explicitly manage threads [15]. However, in the .NET CLR these thread pools are targeted at general-purpose applications and are not tuned for compute-intensive applications. The McRT research program at Intel presented a software prototype of an integrated runtime library for large-scale chip-level multiprocessing (CMP) platforms [17], including a highly configurable, user-level scheduler. It can be used to realize a variety of co-operative scheduling strategies, including work stealing. The design of the Intel TBB scalable allocator is based on contemporary research in scalable memory allocation [8, 9] and utilizes best-known design solutions; it has common roots with Hoard [8], LFMalloc, Vam [10], Streamflow [11] and other state-of-the-art concurrent and sequential allocators. The TBB scalable allocator is a productization of the scalable memory allocator developed as part of the McRT research program [7, 17]. THE TBB TASK SCHEDULER The Intel TBB task scheduler is a work-stealing scheduler. The design of the TBB scheduler is inspired by the early Cilk scheduler, which Blumofe and Leiserson [2, 3] proved has optimal space, time, and communication bounds for well-structured (“fully strict”) programs. In a system that uses work-stealing, each thread maintains a local pool of tasks that are ready to run. Using local pools avoids the contention that may arise with the use of a global task queue. When executed, a task performs work and also may create additional tasks that are placed in the local pool. If a thread’s pool becomes empty, it attempts to steal a task from another random thread’s pool. This approach is in contrast to static scheduling methods where threads are assigned work up-front and from other dynamic scheduling methods where a central pool of tasks (or iterations) is maintained. Blumofe and Leiserson [2, 3] showed that the expected parallel runtime of applications scheduled by the Cilk scheduler is ) ( ] [ 1 ∞ + = T P T O T E P , where 1 T is the “work” or sequential time of the application, and ∞ T is the critical path length. This optimal bound shows that as P ∞, the expected time is only limited by the critical path length (the sequential part) of the application. To achieve these same optimal bounds, the TBB task scheduler also uses a randomized work-stealing algorithm. An overview of its implementation is provided in the following section. An Overview of the Task Scheduler Design The TBB task scheduler evaluates task graphs. A task graph is a directed graph where nodes are tasks, and each node points to its parent, which is another task that is waiting on it to complete, or NULL. Each task has a refcount that counts the number of tasks that have it as their parent. Each task also has a depth, which is usually one more than the depth of its parent. The work of the task is performed by a user-defined function execute that is encapsulated within the task object. To assist in providing an overview of the Intel TBB task scheduler, we use calculation of the n Fibonacci number as a running example. A serial implementation of our Fibonacci example is shown below: long SerialFib( long n ) {

...read moreread less

144 citations

Year	Papers
2014	2
2012	1
2008	9
2007	33
2006	29
2005	23

Intel Technology Journal

Journal Tools

Papers published on a yearly basis

Papers

Intel ® Virtualization Technology for Directed I/O

Power and Thermal Management in the Intel Core Duo Processor

IntelŴVirtualization Technology: Hardware Support for Efficient Processor Virtualization

Nano and Micro Technology-Based Next-Generation Package-Level Cooling Solutions

The Foundations for Scalable Multicore Software in Intel Threading Building Blocks

Related Journals (5)

IEEE Communications Surveys and Tutorials

IEEE Transactions on Vehicular Technology

IEEE Antennas and Wireless Propagation Letters

IEEE Internet of Things Journal

IEEE Transactions on Wireless Communications

Performance Metrics