Work stealing

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1006/JPDC.1996.0107•

Cilk: An Efficient Multithreaded Runtime System

[...]

Robert D. Blumofe¹, Christopher F. Joerg¹, Bradley C. Kuszmaul¹, Charles E. Leiserson¹, Keith H. Randall¹, Yuli Zhou¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

25 Aug 1996-Journal of Parallel and Distributed Computing

TL;DR: It is shown that on real and synthetic applications, the “work” and “critical-path length” of a Cilk computation can be used to model performance accurately, and it is proved that for the class of “fully strict” (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.

...read moreread less

1,751 citations

Journal Article•10.1145/324133.324234•

Scheduling multithreaded computations by work stealing

[...]

Robert D. Blumofe¹, Charles E. Leiserson²•Institutions (2)

University of Texas at Austin¹, Massachusetts Institute of Technology²

01 Sep 1999-Journal of the ACM

TL;DR: This paper gives the first provably good work-stealing scheduler for multithreaded computations with dependencies, and shows that the expected time to execute a fully strict computation on P processors using this scheduler is 1:1.

...read moreread less

Abstract: This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.

...read moreread less

1,643 citations

Proceedings Article•10.1145/1094811.1094852•

X10: an object-oriented approach to non-uniform cluster computing

[...]

Philippe Charles¹, Christian Grothoff², Vijay Saraswat¹, Christopher Michael Donawa¹, Allan H. Kielstra¹, Kemal Ebcioglu¹, Christoph von Praun¹, Vivek Sarkar¹ - Show less +4 more•Institutions (2)

IBM¹, University of California, Los Angeles²

12 Oct 2005

TL;DR: A modern object-oriented programming language, X10, is designed for high performance, high productivity programming of NUCC systems and an overview of the X10 programming model and language, experience with the reference implementation, and results from some initial productivity comparisons between the X 10 and Java™ languages are presented.

...read moreread less

Abstract: It is now well established that the device scaling predicted by Moore's Law is no longer a viable option for increasing the clock frequency of future uniprocessor systems at the rate that had been sustained during the last two decades. As a result, future systems are rapidly moving from uniprocessor to multiprocessor configurations, so as to use parallelism instead of frequency scaling as the foundation for increased compute capacity. The dominant emerging multiprocessor structure for the future is a Non-Uniform Cluster Computing (NUCC) system with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in horizontally scalable cluster configurations such as blade servers. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. Current OO language facilities for concurrent and distributed programming are inadequate for addressing the needs of NUCC systems because they do not support the notions of non-uniform data access within a node, or of tight coupling of distributed nodes.We have designed a modern object-oriented programming language, X10, for high performance, high productivity programming of NUCC systems. A member of the partitioned global address space family of languages, X10 highlights the explicit reification of locality in the form of places}; lightweight activities embodied in async, future, foreach, and ateach constructs; a construct for termination detection (finish); the use of lock-free synchronization (atomic blocks); and the manipulation of cluster-wide global data structures. We present an overview of the X10 programming model and language, experience with our reference implementation, and results from some initial productivity comparisons between the X10 and Java™ languages.

...read moreread less

1,540 citations

Proceedings Article•10.1145/277650.277725•

The implementation of the Cilk-5 multithreaded language

[...]

Matteo Frigo¹, Charles E. Leiserson¹, Keith H. Randall¹•Institutions (1)

Massachusetts Institute of Technology¹

1 May 1998

TL;DR: Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler are presented.

...read moreread less

Abstract: The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.

...read moreread less

1,450 citations

Journal Article•10.1007/S00224-001-0004-Z•

Thread Scheduling for Multiprogrammed Multiprocessors

[...]

Nimar S. Arora¹, Robert D. Blumofe¹, C. G. Plaxton¹•Institutions (1)

University of Texas at Austin¹

01 Jan 2001-Theory of Computing Systems \/ Mathematical Systems Theory

TL;DR: This work presents a user-level thread scheduler for shared-memory multiprocessors, and it achieves linear speedup whenever P is small relative to the parallelism T1/T∈fty .

...read moreread less

Abstract: We present a user-level thread scheduler for shared-memory multiprocessors, and we analyze its performance under multiprogramming We model multiprogramming with two scheduling levels: our scheduler runs at user-level and schedules threads onto a fixed collection of processes, while below this level, the operating system kernel schedules processes onto a fixed collection of processors We consider the kernel to be an adversary, and our goal is to schedule threads onto processes such that we make efficient use of whatever processor resources are provided by the kernel Our thread scheduler is a non-blocking implementation of the work-stealing algorithm For any multithreaded computation with work T 1 and critical-path length T ∈ fty , and for any number P of processes, our scheduler executes the computation in expected time O(T 1 /P A + T ∈ fty P/P A ) , where P A is the average number of processors allocated to the computation by the kernel This time bound is optimal to within a constant factor, and achieves linear speedup whenever P is small relative to the parallelism T 1 /T ∈ fty

...read moreread less

513 citations

...

Expand

Year	Papers
2021	19
2020	15
2019	23
2018	31
2017	37
2016	40

Topic Tools

Papers published on a yearly basis

Papers

Cilk: An Efficient Multithreaded Runtime System

Scheduling multithreaded computations by work stealing

X10: an object-oriented approach to non-uniform cluster computing

The implementation of the Cilk-5 multithreaded language

Thread Scheduling for Multiprogrammed Multiprocessors

Related Topics (5)

Performance Metrics