Using processor affinity in loop scheduling on shared-memory multiprocessors

doi:10.1109/71.273046

Journal Article10.1109/71.273046

Using processor affinity in loop scheduling on shared-memory multiprocessors

Evangelos P. Markatos, +1 more

- 01 Apr 1994

- IEEE Transactions on Parallel and Distri...

- Vol. 5, Iss: 4, pp 379-400

226

TL;DR: The authors propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data and conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds.

Abstract: Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds. >

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Book Chapter•10.1007/978-3-540-89740-8_2

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

John A. Stratton, +2 more

- 28 Nov 2008

TL;DR: A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

...read moreread less

240

Journal Article•10.1006/JPDC.1997.1339

Customized Dynamic Load Balancing for a Network of Workstations

Mohammed J. Zaki, +2 more

- 15 Jun 1997

- Journal of Parallel and Distributed Comp...

TL;DR: It is shown that different load balancing schemes are best for different applications under varying program and system parameters, and a hybrid compile-time and run-time modeling and decision process which selects (customizes) the best scheme is presented.

...read moreread less

156

•Journal Article

Customized Dynamic Load Balancing for a Network of Workstations1

Mohammed J. Zaki, +2 more

- 01 Jan 1995

- IEEE Transactions on Reliability

TL;DR: A hybrid compile time and run time modeling and decision process which selects (customizes) the best scheme, along with automatic generation of parallel code with calls to a run time library for load balancing is presented.

...read moreread less

139

Proceedings Article•10.1145/1772954.1772971

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

John A. Stratton, +6 more

- 24 Apr 2010

TL;DR: Techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms are described, and reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach.

...read moreread less

91

Proceedings Article•10.1145/195473.195583

The effectiveness of multiple hardware contexts

Radhika Thekkath, +1 more

- 01 Nov 1994

TL;DR: The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing, and the ability of an additional processor to exploit program parallelism.

...read moreread less

82

...

Expand

References

•Book

Computer Architecture: A Quantitative Approach

John L. Hennessy, +1 more

- 01 Dec 1989

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.

...read moreread less

12.6K

Journal Article•10.1109/TSE.1985.231547

Allocating Independent Subtasks on Parallel Processors

Clyde P. Kruskal, +1 more

- 01 Oct 1985

- IEEE Transactions on Software Engineerin...

TL;DR: It is shown that allocating an equal number of subtasks to each processor all at once has good efficiency, as a consequence of a rather general theorem which shows how some consequences of the central limit theorem hold even when one cannot prove that thecentral limit theorem applies.

...read moreread less

403

Proceedings Article•10.1145/107971.107985

The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

Anoop Gupta, +2 more

- 02 Apr 1991

TL;DR: This paper uses detailed simulation studies to evaluate the performance of several different scheduling strategies, and shows that in situations where the number of processes exceeds thenumber of processors, regular priority-based scheduling in conjunction with busy-waiting synchronization primitives results in extremely poor processor utilization.

...read moreread less

241

Journal Article•10.1109/12.40843

The performance implications of thread management alternatives for shared-memory multiprocessors

Thomas Anderson, +2 more

- 01 Dec 1989

- IEEE Transactions on Computers

TL;DR: An Ethernet-style backoff algorithm is presented that largely eliminates the effect of normal methods of critical resource waiting, and can be used to to improve throughput, and in some circumstances to avoid locking, improving latency as well.

...read moreread less

201

Proceedings Article•10.1145/74850.74855

The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

Alan L. Cox, +1 more

- 01 Nov 1989

TL;DR: The design and implementation of the PLATINUM memory management system is described, emphasizing the coherent memory, and the cost and performance of a set of application programs running on PLATinUM are measured.

...read moreread less

144

...

Expand

Using processor affinity in loop scheduling on shared-memory multiprocessors

Chat with Paper

AI Agents for this Paper

Citations

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Customized Dynamic Load Balancing for a Network of Workstations

Customized Dynamic Load Balancing for a Network of Workstations1

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

The effectiveness of multiple hardware contexts

References

Computer Architecture: A Quantitative Approach

Allocating Independent Subtasks on Parallel Processors

The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

The performance implications of thread management alternatives for shared-memory multiprocessors

The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

Related Papers (5)

Allocating Independent Subtasks on Parallel Processors

Adaptively scheduling parallel loops in distributed shared-memory systems

Locality and Loop Scheduling on NUMA Multiprocessors

A dynamic scheduling method for irregular parallel programs

Load-sharing in heterogeneous systems via weighted factoring