TL;DR: The experimental results indicate that the QsNET provides excellent performance in most cases, with excellent contention resolution mechanisms, and some important guidelines for applications and I/O servers mapping on large-scale clusters are given.
Abstract: In this paper we present an in-depth description of the Quadrics interconnection network (QsNET) and an experimental performance evaluation on a 64-node AlphaServer cluster. We explore several performance dimensions and scaling properties of the network by using a collection of benchmarks, based on different traffic patterns. Experiments with permutation patterns and uniform traffic are conducted to illustrate the basic characteristics of the interconnect under conditions commonly created by parallel scientific applications. Moreover, the behavior of the QsNET under I/O traffic, and the influence of the placement of the I/O servers are analyzed. The effects of using dedicated I/O nodes or shared I/O nodes are also exposed. In addition, we evaluate how background I/O traffic interferes with other parallel applications running concurrently. The experimental results indicate that the QsNET provides excellent performance in most cases, with excellent contention resolution mechanisms. Some important guidelines for applications and I/O servers mapping on large-scale clusters are also given.
TL;DR: The architecture and design of Starfish is reported on, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations that uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.
Abstract: This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well. Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.
TL;DR: This paper uses a commercial game server to gain insight in interactive, multi-player game servers and proposes a methodology that deals with the related issues in benchmarking this class of applications and the requirements they impose on modern architectures.
Abstract: With the recent explosion in deployment of services to large numbers of customers over the Internet and in global services in general, issues related to the architecture of scalable servers are becoming increasingly important However, our understanding of these types of applications is currently limited, especially on how well they scale to support large numbers of users One such, novel, commercial class of applications, are interactive, multi-player game servers Multi-player games are both an important class of commercial applications (in the entertainment industry) and they can be valuable in understanding the architectural requirements of scalable services They impose requirements on system performance, scalability, and availability, stressing multiple aspects of the system architecture (eg, compute cycles and network I/O) Recently there has been a lot of interest on client side issues with respect to games However, there has been little or no work on the server side In this paper we use a commercial game server to gain insight in this class of applications and the requirements they impose on modern architectures We find that: (1) In terms of the benchmarking methodology, interactive game servers are very different from scientific workloads We propose a methodology that deals with the related issues in benchmarking this class of applications Our methodology bears many similarities with methodologies used in benchmarking online transaction processing (OLTP) systems (2) Current, sequential game servers can support at most up to a few tens of users (60–100) on existing processors (3) The bottleneck in the server is both game-related as well as network-related processing (about 50–50) (4) Network bandwidth requirements are not an important issue for the numbers of players we are interested in (5) The processor achieves a surprisingly low IPC of 0416
TL;DR: This paper introduces the concept of store-and-bypass for divisible load theory and shows that the model outperforms the existing model such as Cheng–Robertazzi model.
Abstract: A new model for divisible load problem is introduced. Its characteristics are analyzed. Optimal load distribution algorithms on the new model are presented for the tree-network and linear network. Applications that fit our model are briefly described. We show that our model outperforms the existing model such as Cheng–Robertazzi model. We show that the linear model is equivalent to a single-level tree network if the intermediate processors do not follow the store-and-forward communication model, but they follow the store-and-bypass model. This paper introduces the concept of store-and-bypass for divisible load theory.
TL;DR: A new robust method is proposed to solve the problem of finding optimal distribution of computations on star network, and networks in which binomial trees can be embedded (meshes, hypercubes, multistage interconnections).
Abstract: In this work we consider scheduling divisible loads on a distributed computing system with limited available memory. The communication delays and heterogeneity of the system are taken into account. The problem studied consists in finding such a distribution of the load that the communication and computation time is the shortest possible. A new robust method is proposed to solve the problem of finding optimal distribution of computations on star network, and networks in which binomial trees can be embedded (meshes, hypercubes, multistage interconnections). We demonstrate that in many cases memory limitations do not restrict efficiency of parallel processing as much as computation and communication speeds.
TL;DR: This paper proposes a new strategy for scheduling moldable jobs that outperforms not only the traditional rigid scheme, but also the previous moldable scheduling policies, by doing uniformly well under different load conditions and for jobs of different scalabilities.
Abstract: Moldable job scheduling has been proved to be effective compared to traditionaljob scheduling policies. It is based on the observation that most jobs submitted to a spaceshared parallel system can actually reduce their response times if they were allowed to take any number of processors in a user-specified range. Previous approaches to scheduling of moldable jobs focused on when and how to choose the number of processors for a moldable job. Careful experimental evaluations show that these techniques are not robust. This paper proposes a new strategy for scheduling moldable jobs that outperforms not only the traditional rigid scheme, but also the previous moldable scheduling policies, by doing uniformly well under different load conditions and for jobs of different scalabilities.
TL;DR: JavaSplit as discussed by the authors is a portable runtime for distributed execution of multithreaded Java programs, which transparently distributes threads and objects of an application among the participating nodes without modifying the Java multi-threaded programming conventions.
Abstract: This paper presents JavaSplit, a portable runtime for distributed execution of multithreaded Java programs. JavaSplit transparently distributes threads and objects of an application among the participating nodes. Thus, it gains augmented computational power and increased memory capacity without modifying the Java multithreaded programming conventions. JavaSplit works by rewriting the bytecodes of a given parallel application, transforming it into a distributed application that incorporates all the runtime logic. Each runtime node carries out its part of the resulting distributed computation using nothing but its local standard (unmodified) Java Virtual Machine (JVM). This is unlike previous Java-based distributed runtime systems, which use a specialized JVM or utilize unconventional programming constructs. Since JavaSplit is orthogonal to the implementation of a local JVM, it achieves portability across any existing platform and allows each node to locally optimize the performance of its JVM, e.g., via a just-in-time compiler (JIT).
TL;DR: The Multi-Installment Balancing Strategy (MIBS) presented in this paper, manages to address both of these constraints by building on-top of the analytical solutions derived by a buffer capacity-unaware approach.
Abstract: In this paper we address the problem of processing a computationally intensive divisible load with high memory requirements on a bus network. Each network node is assumed to have a limited memory capacity (buffer space), while at the same time being available for processing after a specific time (release time). The combined influence of the release times, as well as the limited buffer capacity available, is considered in the problem formulation, with the objective to minimize the overall processing time of the divisible load. In the existing literature, these two issues have been considered independently, although in practice, they are commonly found to coexist. The Multi-Installment Balancing Strategy (MIBS) presented in this paper, manages to address both of these constraints by building on-top of the analytical solutions derived by a buffer capacity-unaware approach. MIBS monitors the available resources and adapts the processing and communication phases according to their availability. Towards this goal both single and/or multi-installment scheduling is utilized. The description of the algorithms accompany simulation experiments that highlight the behavior of MIBS. It should be stressed that the use of MIBS allows the processing of loads that exceed by far the total memory capacity of the available machines, while at the same time exhibiting processing times that match the ones predicted by strategies that ignore the memory constraints.
TL;DR: In this paper, the authors present an approach used as a basis for system adaptation in which Grid jobs are maintained at runtime and reflective technique is used to simplify the adaptation in the Grid application.
Abstract: A Grid system must integrate heterogeneous resources with varying quality and availability For example, the load on any given resource may increase during execution of a time-constrained job This places importance on the system’s ability to recognize the state of these resources This paper presents an approach used as a basis for system adaptation in which Grid jobs are maintained at runtime A Reflective technique is used to simplify the adaptation in the Grid application The design of an adaptable Resource Broker is described and experimentally evaluated Reflection is incorporated into the broker to separate functional and non-functional aspects of the system and facilitate the implementation of non-functional properties such as job migration Results indicate that this approach enhances the likelihood of timely job completion in a dynamic Grid system
TL;DR: A summary of several industry and academic perspectives on this issue expressed during a panel discussion at the Workshop for Communication Architecture for Clusters, held in conjunction with the International Parallel and Distributed Processing Symposium in April 2001, in hopes of narrowing down the design space for InfiniBand-based systems.
Abstract: InfiniBand is a new industry-wide general-purpose interconnect standard designed to provide significantly higher levels of reliability, availability, performance, and scalability than alternative server I/O technologies. After more than two years since its official release, many are still trying to understand what are the profitable uses for this new and promising interconnect technology, and how this technology might evolve. In this article, we provide a summary of several industry and academic perspectives on this issue expressed during a panel discussion at the Workshop for Communication Architecture for Clusters (CAC), held in conjunction with the International Parallel and Distributed Processing Symposium (IPDPS) in April 2001, in hopes of narrowing down the design space for InfiniBand-based systems.
TL;DR: The results show that the proposed algorithms offer drastic improvements in discrete request average response times, are fair, serve continuous requests without interruptions, and that the disk technology trends are such that the expected performance benefits can be even greater in the future.
Abstract: Divisible load scenarios occur in modern media server applications since most multimedia applications typically require access to continuous and discrete data. A high performance Continuous Media (CM) server greatly depends on the ability of its disk IO subsystem to serve both types of workloads efficiently. Disk scheduling algorithms for mixed media workloads, although they play a central role in this task, have been overlooked by related research efforts. These algorithms must satisfy several stringent performance goals, such as achieving low response time and ensuring fairness, for the discrete-data workload, while at the same time guaranteeing the uninterrupted delivery of continuous data, for the continuous-data workload. The focus of this paper is on disk scheduling algorithms for mixed media workloads in a multimedia information server. We propose novel algorithms, present a taxonomy of relevant algorithms, and study their performance through experimentation. Our results show that our algorithms offer drastic improvements in discrete request average response times, are fair, serve continuous requests without interruptions, and that the disk technology trends are such that the expected performance benefits can be even greater in the future.
TL;DR: A complete exchange algorithm, the Synchronous Shuffle Exchange, which is an optimal algorithm on the non-blocking network, and a contention-aware permutation scheme, which relieves the congestion build-up at the uplink ports and improves the synchronism of the traffic information exchange between cluster nodes.
Abstract: A lot of efforts have been devoted to address the software overhead problem in the past decade, which is known as the major hindrance on high-speed communication. However, this paper shows that having a low-latency communication system does not guarantee to achieve high performance, as there are other communication issues that have not been fully addressed by the use of low-latency communication, such as contention and scheduling of communication events. In this paper, we use the complete exchange operation as a case study to show that with careful design of communication schedules, we can achieve efficient communication as well as prevent congestion. We have developed a complete exchange algorithm, the Synchronous Shuffle Exchange, which is an optimal algorithm on the non-blocking network. To avoid congestion loss caused by the non-deterministic delays in communication events, a global congestion control scheme is introduced. This scheme coordinates all participating nodes to monitor and regulate the traffic load, which effectively avoids congestion loss and maintains sufficient throughput to maximize the performance. To improve the effectiveness of the congestion control scheme when working on the hierarchical network, we incorporate information on the network topology to devise a contention-aware permutation. This permutation scheme generates a communication schedule, which is both node and switch contention-free as well as distributing the network loads more evenly across the hierarchy. This relieves the congestion build-up at the uplink ports and improves the synchronism of the traffic information exchange between cluster nodes. Performance results of our implementation on a 32-node cluster with various network configurations are examined and reported in this paper.
TL;DR: In this paper, an iterative fair scheduling (iFS) scheme for input buffered switches that supports fair bandwidth distribution among the flows and achieves asymptotically 100% throughput is presented.
Abstract: Input buffered switch architecture has become attractive for implementing high performance switches for workstation clusters. It is challenging to provide a scheduling technique that is both highly efficient and fair in resource allocation. In this paper, we first introduce an iterative Fair Scheduling (iFS) scheme for input buffered switches that supports fair bandwidth distribution among the flows and achieves asymptotically 100% throughput. We then apply the idea of fair scheduling to switches with multicasting capability and propose an mFS scheme which allocates bandwidth to various flows according to their reservations. We show that mFS produces throughput comparable to the existing schemes while distributing the bandwidth as per the given reservations. Extensive simulation results are presented to validate the effectiveness of our proposed schemes.
TL;DR: In this paper, an algorithm to compute the optimal and near-optimal alignments of two DNA sequences in linear space and quadratic time is presented, which can be parallelized efficiently on a PC cluster and on a computational grid in order to reduce its runtime significantly.
Abstract: Molecular biologists frequently align DNA sequences of entire genomes to detect important matched and mismatched regions. Even though efficient dynamic programming algorithms exist for this problem, the required computing time is still very high due to the size of these sequences (usually a few million base pairs in length). Because the number of sequenced organisms is increasing rapidly, fast and accurate solutions are of highest importance to research in this area. In this paper we present an algorithm to compute the optimal and near-optimal alignments of two sequences in linear space and quadratic time. We demonstrate how this algorithm can be parallelized efficiently on a PC cluster and on a computational grid in order to reduce its runtime significantly. The grid implementation uses a hierarchical approach combining inter-cluster and intra-cluster parallelism.
TL;DR: The paper addresses critical design issues faced on the commodity clusters and possible solutions for matching the low-level network protocol with user-level interfaces and offers some indications on what additional features would be desirable in a communication library like GM to better support one-sided communication.
Abstract: This paper describes an efficient implementation of one-sided communication on top of the GM low-level message-passing library for clusters with Myrinet. This approach is compatible with shared memory, exploits pipelining, nonblocking communication, and overlapping memory registration with memory copy to maximize the transfer rate. The paper addresses critical design issues faced on the commodity clusters and then describes possible solutions for matching the low-level network protocol with user-level interfaces. The performance implications of the design decisions are presented and discussed in context of a standalone communication benchmark as well as two applications. Finally, the paper offers some indications on what additional features would be desirable in a communication library like GM to better support one-sided communication.
TL;DR: The performance of gossip services employing flat and hierarchical schemes is analyzed on an experimental testbed in terms of consensus time, resource utilization and scalability.
Abstract: Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. Extending the gossip protocol such that a system reaches consensus on detected faults can be performed via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. In this paper, the performance of gossip services employing flat and hierarchical schemes is analyzed on an experimental testbed in terms of consensus time, resource utilization and scalability. Performance associated with a hierarchically arranged gossip scheme is analyzed with varying group sizes and is shown to scale well. Resource utilization of the gossip-style failure detection and consensus service is measured in terms of network bandwidth utilization and CPU utilization. Analytical models are developed for resource utilization and performance projections are made for large system sizes.
TL;DR: A new scheduling algorithm is proposed which distributes load in a sequence of stages across the network, each stage brings load to a set of processors located at the same distance from the load source.
Abstract: We study the problem of scheduling a divisible load in a three-dimensional mesh of processors. The objective is to find partition of a load into shares and distribution of load shares among processors which minimize load processing time subject to communication delays involved in sending load from one processor to another. We propose a new scheduling algorithm which distributes load in a sequence of stages across the network, each stage brings load to a set of processors located at the same distance from the load source. A key feature of our solution is that sets of processors receive load in the order of decreasing processing capacities. We call this scheduling strategy Largest Layer First. A theorem about the processing time attained by the algorithm is stated. Performance of the algorithm is compared to earlier results.
TL;DR: From the results of an evaluation project on three Beowulf type clusters, answers are derived about the viability of using cluster systems routinely in a multi-user environment with comparable maintenance cost and effort to that of an integrated parallel machine.
Abstract: We report the results of an evaluation project on three Beowulf type clusters. The purpose of this study was to assess both the performance of the clusters and the availability and quality of the software for cluster management and management of the available resources. This last goal could hardly be achieved because at the time this project was undertaken much of the management software was either very immature or not yet available. However, it was possible to assess the cluster performance both from the point of view of single program execution as well as with respect to throughput by loading the systems according to a predefined schedule via the available batch systems. To this end a set of application programs, ranging from astronomy to quantum chemistry, together with a synthetic benchmark were employed. From the results we wanted to derive answers about the viability of using cluster systems routinely in a multi-user environment with comparable maintenance cost and effort to that of an integrated parallel machine.
TL;DR: Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains.
Abstract: Heterogeneous networked clusters are being increasingly used as platforms for resource-intensive parallel and distributed applications. The fundamental underlying idea is to provide large amounts of processing capacity over extended periods of time by harnessing the idle and available resources on the network in an opportunistic manner. In this paper we present the design, implementation and evaluation of a framework that uses JavaSpaces to support this type of opportunistic adaptive parallel/distributed computing over networked clusters in a non-intrusive manner. The framework targets applications exhibiting coarse grained parallelism and has three key features: (1) portability across heterogeneous platforms, (2) minimal configuration overheads for participating nodes, and (3) automated system state monitoring (using SNMP) to ensure non-intrusive behavior. Experimental results presented in this paper demonstrate that for applications that can be broken into coarse-grained, relatively independent tasks, the opportunistic adaptive parallel computing framework can provide performance gains. Furthermore, the results indicate that monitoring and reacting to the current system state minimizes the intrusiveness of the framework.
TL;DR: The Genoa Active Message MAchine (GAMMA) as discussed by the authors is a lightweight communication system based on the Active Ports paradigm, originally designed for efficient implementation over low-cost Fast Ethernet interconnects.
Abstract: The Genoa Active Message MAchine (GAMMA) is a lightweight communication system based on the Active Ports paradigm, originally designed for efficient implementation over low-cost Fast Ethernet interconnects. In this paper we report about the recently completed porting of GAMMA to the Packet Engines GNIC-II and the Netgear GA620 Gigabit Ethernet adapters, and provide a comparison among GAMMA, MPI/GAMMA, TCP/IP, and MPICH, on such commodity interconnects, using different performance metrics. With a combination of low end-to-end latency (9.5 μs with GNIC-II, 32 μs with GA620) and high transmission throughput (almost 97 MByte/s with GNIC-II and 125 MByte/s with GA620, the latter obtained without changing the firmware of the adapter), GAMMA demonstrates the potential for Gigabit Ethernet lightweight protocols to yield messaging performance comparable to the best Myrinet-based messaging systems. This result is of interest, given the envisaged drop in cost of Gigabit Ethernet due to the transition from fiber optic to UTP cabling and ever increasing mass market production of such standard interconnect. We also reports about a technique for message fragmentation that is commonly exploited to increase the throughput with short message. When a different, though more widely used, performance metrics is considered, such a technique results into a performance loss rather than improvement.
TL;DR: This paper attempts to elaborate performance analysis of Myrinet-based cluster by extending the extension of the point-to-point communication model and showing that its models can make better estimation of the communication performance than the previous models.
Abstract: In recent years, there has been a growing interest in the cluster system as an accepted form of supercomputing, due to its high performance at an affordable cost. This paper attempts to elaborate performance analysis of Myrinet-based cluster. The communication performance and effect of background load on parallel applications were analyzed. For point-to-point communication, it was found that an extension to the Hockney's model was required to estimate the performance. The proposed model suggested that there should be two ranges to be used for the performance metrics to cope with the cache effect. Moreover, based on the extension of the point-to-point communication model, the Xu and Hwang's model for collective communication performance was also extended. Results showed that our models can make better estimation of the communication performance than the previous models. Finally, the interference of other user processes to the cluster system is evaluated by using synthetic background load generation programs.
TL;DR: An I/O performance prediction mechanism which consists of a performance database and a prediction algorithm to help users better evaluate and schedule their applications and an Application Programming Interface that provides transparent management and access to various storage resources in the computing environment are established.
Abstract: I/O intensive applications have posed great challenges to computational scientists. A major problem of these applications is that users have to sacrifice performance requirements in order to satisfy storage capacity requirements in a conventional computing environment. Further performance improvement is impeded by the physical nature of these storage media even when state-of-the-art I/O optimizations are employed.
In this paper, we present a distributed multi-storage resource architecture, which can satisfy both performance and capacity requirements by employing multiple storage resources. Compared to a traditional single storage resource architecture, our architecture provides a more flexible and reliable computing environment. This architecture can bring new opportunities for high performance computing as well as inherit state-of-the-art I/O optimization approaches that have already been developed. It provides application users with high-performance storage access even when they do not have the availability of a single large local storage archive at their disposal. We also develop an Application Programming Interface (API) that provides transparent management and access to various storage resources in our computing environment. Since I/O usually dominates the performance in I/O intensive applications, we establish an I/O performance prediction mechanism which consists of a performance database and a prediction algorithm to help users better evaluate and schedule their applications. A tool is also developed to help users automatically generate performance data stored in databases. The experiments show that our multi-storage resource architecture is a promising platform for high performance distributed computing.
TL;DR: By adopting the new delegation profile as a kernel, a new trust framework is then proposed to enhance the security verification ability and provide more fine-grained authorizations to mobile agents platforms.
Abstract: In this paper * , an instance-oriented security mechanism is proposed, to attack possible security threats in grid-based mobile agent system. The proposed delegation profile allows application systems to define their own security instances, while it provides mechanisms to delegate one’s identity on those instances, instead of on certain hosts, just like the conventional delegation does. This can prevent the delegated host from abusing privileges. By adopting the new delegation profile as a kernel, a new trust framework is then proposed to enhance the security verification ability and provide more fine-grained authorizations to mobile agents platforms.
TL;DR: Preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases.
Abstract: Over the past few years, cluster/distributed computing has been gaining popularity. The proliferation of the cluster/distributed computing is due to the improved performance and increased reliability of these systems. Many parallel programming languages and related parallel programming models have become widely accepted. However, one of the major shortcomings of running parallel applications on cluster/distributed computing environments is the high communication overhead incurred. To reduce the communication overhead, and thus the completion time of a parallel application, this paper describes a simple, efficient and portable Key Message (KM) approach to support parallel computing on cluster/distributed computing environments. To demonstrate the advantage of the KM approach, a prototype runtime system has been implemented and evaluated. Our preliminary experimental results show that the KM approach has better improvement on communication of a parallel application when network background load increases or the computation to communication ratio of the application decreases.
TL;DR: An implementation of the communication library and a quantitative model that is used to estimate the performance impact of priorities for a typical situation are described and it is shown that the use of high-priority communication reduces the latency of performance critical messages substantially over a wide range of network design parameters.
Abstract: Software Distributed Shared Memory (DSM) systems can be used to provide a coherent shared address space on multicomputers and other parallel systems without support for shared memory in hardware. The coherency software automatically translates shared memory accesses to explicit messages exchanged among the nodes in the system. Many applications exhibit a good performance on such systems but it has been shown that, for some applications, performance critical messages can be delayed behind less important messages because of the enqueuing behavior in the communication libraries used in current systems. We present in this paper a new portable communication library that supports priorities to remedy this situation. We describe an implementation of the communication library and a quantitative model that is used to estimate the performance impact of priorities for a typical situation. Using the model, we show that the use of high-priority communication reduces the latency of performance critical messages substantially over a wide range of network design parameters. The latency is reduced with up to 10–25% for each delaying low priority message in the queue ahead.
TL;DR: The equivalent tree network methodology presented in this paper, is more general than the earlier results, because in this approach, it can solve the scheduling problem even in an hetrogeneous linear network.
Abstract: In this paper, divisible load scheduling in a linear network of processors is presented. The cases of processing load originating at the boundary and also at the interior of the network are considered. An equivalent tree network for the given linear network is derived. Using this equivalent tree network, we prove all the results obtained in the earlier studies. The equivalent tree network methodology presented in this paper, is more general than the earlier results, because in this approach, we can solve the scheduling problem even in an hetrogeneous linear network. The earlier studies considered only homogeneous linear network.
TL;DR: A model designed for interactive simulation of cluster-based asynchronous soft real-time systems such as the Jambala platform of Ericsson and the flexibility of the simulation tool is demonstrated on a number of “what-if” scenarios that also pinpoint some important features of such clusters.
Abstract: In this paper, we present a model designed for interactive simulation of cluster-based asynchronous soft real-time systems such as the Jambala platform of Ericsson. To build the simulator, we selected PlasmaCORE – Ericsson's proprietary simulation framework that supports a wide range of run-time modifications to the simulated system. Based on this choice, the information model of the system was developed and a prototype tool was implemented. We describe the essential features of the model and subsequently, we demonstrate the feasibility of the tool by presenting the benchmark results that compare the model-based simulation results with measurements taken on a real Jambala cluster. The flexibility of the simulation tool is demonstrated on a number of “what-if” scenarios that also pinpoint some important features of such clusters.
TL;DR: Mac clustering is becoming the technology that will move parallel computing into the mainstream and the ongoing dessimination of OS X, a Unix-based Mac OS, is providing the best tools of the Mac and Unix in one computing solution.
Abstract: At UCLA's Plasma Physics Group, to achieve accessible computational power for our research goals, we developed the tools to build numerically-intensive parallel computing clusters on the Macintosh platform. Our technology maximizes productivity because it is designed to allow the user, without expertise in the operating system, to most efficiently develop and run parallel code, enabling the most effective advancement of scientific research. Collaborating with USC and NASA’s JPL, our team has demonstrated the performance and scalability potential of Mac clusters by achieving over 217 Gigaflops on 33 XServes and over 233 Gigaflops on 76 Power Mac G4s. But we find that the usability and reliability of the technology is as important as its performance. The ongoing dissemination of OS X, a Unix-based Mac OS, is providing the best tools of the Mac and Unix in one computing solution. With this development, Mac clustering is becoming the technology that will move parallel computing into the mainstream. See: http://exodus.physics.ucla.edu/appleseed/ and http://daugerresearch.com/
TL;DR: This work uses an existing, large-scale hardware cache-coherent system with 64 processors to emulate a complete future cluster, and finds that system emulation is invaluable in quantifying potential benefits from changes in the technology of commodity components and reveals potential problems in future systems that are easily overlooked in simulation studies.
Abstract: Recently much effort has been spent on providing a shared address space abstraction on clusters of small-scale symmetric multiprocessors. However, advances in technology will soon make it possible to construct these clusters with larger-scale cc-NUMA nodes, connected with non-coherent networks that offer latencies and bandwidth comparable to interconnection networks used in hardware cache-coherent systems. The shared memory abstraction can be provided on these systems in software across nodes and hardware within nodes.
Recent simulation results have demonstrated that certain features of modern system area networks can be used to greatly reduce shared virtual memory (SVM) overheads [5,19]. In this work we leverage these results and we use detailed system emulation to investigate building future software shared memory clusters. We use an existing, large-scale hardware cache-coherent system with 64 processors to emulate a complete future cluster. We port our existing infrastructure (communication layer and shared memory protocol) on this system and study the behavior of a set of real applications. We present results for both 32- and 64-processor system configurations.
We find that: (i) System emulation is invaluable in quantifying potential benefits from changes in the technology of commodity components. More importantly, it reveals potential problems in future systems that are easily overlooked in simulation studies. Thus, system emulation should be used along with other modeling techniques (e.g., simulation, implementation) to investigate future trends. (ii) Our work shows that current SVM protocols can only partially take advantage of faster interconnects and wider nodes due to operating system and architectural implications. We quantify the related issues and identify the areas where more research is required for future SVM clusters.