Journal Article10.1109/71.273046
Using processor affinity in loop scheduling on shared-memory multiprocessors
226
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
John A. Stratton,Sam S. Stone,Wen-mei W. Hwu +2 more
- 28 Nov 2008
TL;DR: A framework called MCUDA is described, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs and argues that CUDA can be an effective data-parallel programming model for more than just GPU architectures.
240
Customized Dynamic Load Balancing for a Network of Workstations
TL;DR: It is shown that different load balancing schemes are best for different applications under varying program and system parameters, and a hybrid compile-time and run-time modeling and decision process which selects (customizes) the best scheme is presented.
156
•Journal Article
Customized Dynamic Load Balancing for a Network of Workstations1
TL;DR: A hybrid compile time and run time modeling and decision process which selects (customizes) the best scheme, along with automatic generation of parallel code with calls to a run time library for load balancing is presented.
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
John A. Stratton,Vinod Grover,Jaydeep Marathe,Bastiaan Aarts,Michael Murphy,Ziang Hu,Wen-mei W. Hwu +6 more
- 24 Apr 2010
TL;DR: Techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms are described, and reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach.
91
The effectiveness of multiple hardware contexts
Radhika Thekkath,Susan J. Eggers +1 more
- 01 Nov 1994
TL;DR: The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing, and the ability of an additional processor to exploit program parallelism.
82
References
•Book
Computer Architecture: A Quantitative Approach
John L. Hennessy,David A. Patterson +1 more
- 01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
12.6K
Allocating Independent Subtasks on Parallel Processors
Clyde P. Kruskal,A. Weiss +1 more
TL;DR: It is shown that allocating an equal number of subtasks to each processor all at once has good efficiency, as a consequence of a rather general theorem which shows how some consequences of the central limit theorem hold even when one cannot prove that thecentral limit theorem applies.
403
The impact of operating system scheduling policies and synchronization methods of performance of parallel applications
Anoop Gupta,Andrew Tucker,Shigeru Urushibara +2 more
- 02 Apr 1991
TL;DR: This paper uses detailed simulation studies to evaluate the performance of several different scheduling strategies, and shows that in situations where the number of processes exceeds thenumber of processors, regular priority-based scheduling in conjunction with busy-waiting synchronization primitives results in extremely poor processor utilization.
241
The performance implications of thread management alternatives for shared-memory multiprocessors
TL;DR: An Ethernet-style backoff algorithm is presented that largely eliminates the effect of normal methods of critical resource waiting, and can be used to to improve throughput, and in some circumstances to avoid locking, improving latency as well.
201
The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum
Alan L. Cox,Robert J. Fowler +1 more
- 01 Nov 1989
TL;DR: The design and implementation of the PLATINUM memory management system is described, emphasizing the coherent memory, and the cost and performance of a set of application programs running on PLATinUM are measured.
144