Conference
International Parallel Processing Symposium
About: International Parallel Processing Symposium is an academic conference. The conference publishes majorly in the area(s): Parallel algorithm & Computer science. Over the lifetime, 1102 publications have been published by the conference receiving 17022 citations.
Papers
Proceedings Article•
24 Aug 1999
TL;DR: This work describes a resource management architecture that distributes the resource management problem among distinct local manager, resource broker, and resource co-allocator components and defines an extensible resource specification language to exchange information about requirements.
Abstract: Metacomputing systems are intended to support remote and/or concurrent use of geographically distributed computational resources. Resource management in such systems is complicated by five concerns that do not typically arise in other situations: site autonomy and heterogeneous substrates at the resources, and application requirements for policy extensibility, co-allocation, and online control. We describe a resource management architecture that addresses these concerns. This architecture distributes the resource management problem among distinct local manager, resource broker, and resource co-allocator components and defines an extensible resource specification language to exchange information about requirements. We describe how these techniques have been implemented in the context of the Globus metacomputing toolkit and used to implement a variety of different resource management strategies. We report on our experiences applying our techniques in a large testbed, GUSTO, incorporating 15 sites, 330 computers, and 3600 processors.
849 citations
Proceedings Article•
1 Jan 2000TL;DR: This work proposes and evaluates several algorithms for supporting advanced reservation of resources in supercomputing scheduling systems and finds that the wait times of applications submitted to the queue increases when reservations are supported and the increase depends on how reservations aresupported.
Abstract: Some computational grid applications have very large resource requirements and need simultaneous access to resources from more than one parallel computer. Current scheduling systems do not provide mechanisms to gain such simultaneous access without the help of human administrators of the computer systems. In this work, we propose and evaluate several algorithms for supporting advanced reservation of resources in supercomputing scheduling systems. These advanced reservations allow users to request resources from scheduling systems at specific times. We find that the wait times of applications submitted to the queue increases when reservations are supported and the increase depends on how reservations are supported. Further, we find that the best performance is achieved when we assume that applications can be terminated and restarted, backfilling is performed, and relatively accurate run-time predictions are used.
311 citations
12 Apr 1999
TL;DR: ARMCI provides one-sided communication capabilities for distributed array libraries and compiler run-time systems and supports remote memory copy, accumulate, and synchronization operations optimized for non-contiguous data transfers including strided and generalized UNIX I/O vector interfaces.
Abstract: This paper introduces a new portable communication library called ARMCI. ARMCI provides one-sided communication capabilities for distributed array libraries and compiler run-time systems. It supports remote memory copy, accumulate, and synchronization operations optimized for non-contiguous data transfers including strided and generalized UNIX I/O vector interfaces. The library has been employed in the Global Arrays shared memory programming toolkit and Adlib, a Parallel Compiler Run-time Consortium run-time system.
284 citations
1 Apr 1994
TL;DR: A method to characterize the performance of proposed queue lock algorithms, and applies it to previously published algorithms conclude that the M lock is the best overall queue lock for the class of architectures studied.
Abstract: Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A key issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&(Test&Set). For architectures with long access latencies for global data, attention should also be paid to the number of global accesses that are involved in synchronization. We present a method to characterize the performance of proposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size and required hardware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed to perform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied. >
227 citations
1 Apr 1997
TL;DR: The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy, and complement traffic reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies.
Abstract: The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we study the communication performance of a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35-40% of the network capacity with one virtual channel, 55-60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestion-free communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies.
220 citations
Performance Metrics
| Year | Papers |
|---|---|
| 2000 | 6 |
| 1999 | 244 |
| 1998 | 118 |
| 1997 | 116 |
| 1996 | 9 |
| 1995 | 120 |