TL;DR: This chapter discusses sorting on a Linear Array with a Systolic and Semisystolic Model of Computation, which automates the very labor-intensive and therefore time-heavy and expensive process of manually sorting arrays.
Abstract: Preface Acknowledgments Notation 1 Arrays and Trees 1.1 Elementary Sorting and Counting 1.1.1 Sorting on a Linear Array Assessing the Performance of the Algorithm Sorting N Numbers with Fewer Than N Processors 1.1.2 Sorting in the Bit Model 1.1.3 Lower Bounds 1.1.4 A Counterexample-Counting 1.1.5 Properties of the Fixed-Connection Network Model 1.2 Integer Arithmetic 1.2.1 Carry-Lookahead Addition 1.2.2 Prefix Computations-Segmented Prefix Computations 1.2.3 Carry-Save Addition 1.2.4 Multiplication and Convolution 1.2.5 Division and Newton Iteration 1.3 Matrix Algorithms 1.3.1 Elementary Matrix Products 1.3.2 Algorithms for Triangular Matrices 1.3.3 Algorithms for Tridiagonal Matrices -Odd-Even Reduction -Parallel Prefix Algorithms 1.3.4 Gaussian Elimination 1.3.5 Iterative Methods -Jacobi Relaxation -Gauss-Seidel Relaxation Finite Difference Methods -Multigrid Methods 1.4 Retiming and Systolic Conversion 1.4.1 A Motivating Example-Palindrome Recognition 1.4.2 The Systolic and Semisystolic Model of Computation 1.4.3 Retiming Semisystolic Networks 1.4.4 Conversion of a Semisystolic Network into a Systolic Network 1.4.5 The Special Case of Broadcasting 1.4.6 Retiming the Host 1.4.7 Design by Systolic Conversion-A Summary 1.5 Graph Algorithms 1.5.1 Transitive Closure 1.5.2 Connected Components 1.5.3 Shortest Paths 1.5.4 Breadth-First Spanning Trees 1.5.5 Minimum Weight Spanning Trees 1.6 Sorting Revisited 1.6.1 Odd-Even Transposition Sort on a Linear Array 1.6.2 A Simple Root-N(log N + 1)-Step Sorting Algorithm 1.6.3 A (3 Root- N + o(Root-N))-Step Sorting Algorithm 1.6.4 A Matching Lower Bound 1.7 Packet Routing 1.7.1 Greedy Algorithms 1.7.2 Average-Case Analysis of Greedy Algorithms -Routing N Packets to Random Destinations -Analysis of Dynamic Routing Problems 1.7.3 Randomized Routing Algorithms 1.7.4 Deterministic Algorithms with Small Queues 1.7.5 An Off-line Algorithm 1.7.6 Other Routing Models and Algorithms 1.8 Image Analysis and Computational Geometry 1.8.1 Component-Labelling Algorithms -Levialdi's Algorithm -An O (Root-N)-Step Recursive Algorithm 1.8.2 Computing Hough Transforms 1.8.3 Nearest-Neighbor Algorithms 1.8.4 Finding Convex Hulls 1.9 Higher-Dimensional Arrays 1.9.1 Definitions and Properties 1.9.2 Matrix Multiplication 1.9.3 Sorting 1.9.4 Packet Routing 1.9.5 Simulating High-Dimensional Arrays on Low-Dimensional Arrays 1.10 problems 1.11 Bibliographic Notes 2 Meshes of Trees 2.1 The Two-Dimensional Mesh of Trees 2.1.1 Definition and Properties 2.1.2 Recursive Decomposition 2.1.3 Derivation from KN,N 2.1.4 Variations 2.1.5 Comparison With the Pyramid and Multigrid 2.2 Elementary O(log N)-Step Algorithms 2.2.1 Routing 2.2.2 Sorting 2.2.3 Matrix-Vector Multiplication 2.2.4 Jacobi Relaxation 2.2.5 Pivoting 2.2.6 Convolution 2.2.7 Convex Hull 2.3 Integer Arithmetic 2.3.1 Multiplication 2.3.2 Division and Chinese Remaindering 2.3.3 Related Problems -Iterated Products -Rooting Finding 2.4 Matrix Algorithms 2.4.1 The Three-Dimensional Mesh of Trees 2.4.2 Matrix Multiplication 2.4.3 Inverting Lower Triangular Matrices 2.4.4 Inverting Arbitrary Matrices -Csanky's Algorithm -Inversion by Newton Iteration 2.4.5 Related Problems 2.5 Graph Algorithms 2.5.1 Minimum-Weight Spanning Trees 2.5.2 Connected Components 2.5.3 Transitive Closure 2.5.4 Shortest Paths 2.5.5 Matching Problems 2.6 Fast Evaluation of Straight-Line Code 2.6.1 Addition and Multiplication Over a Semiring 2.6.2 Extension to Codes with Subtraction and Division 2.6.3 Applications 2.7 Higher-Dimensional meshes of Trees 2.7.1 Definitions and Properties 2.7.2 The Shuffle-Tree Graph 2.8 Problems 2.9 Bibliographic Notes 3 Hypercubes and Related Networks 3.1 The Hypercube 3.1.1 Definitions and Properties 3.1.2 Containment of Arrays -Higher-Dimensional Arrays -Non-Power-of-2 Arrays 3.1.3 Containment of Complete Binary Trees 3.1.4 Embeddings of Arbitrary Binary Trees -Embeddings with Dilation 1 and Load O(M over N + log N) -Embeddings with Dilation O(1) and Load O (M over N + 1) -A Review of One-Error-Correcting Codes -Embedding Plog N into Hlog N 3.1.5 Containment of Meshes of Trees 3.1.6 Other Containment Results 3.2 The Butterfly, Cube-Connected-Cycles , and Benes Network 3.2.1 Definitions and Properties 3.2.2 Simulation of Arbitrary Networks 3.2.3 Simulation of Normal Hypercube Algorithms 3.2.4 Some Containment and Simulation Results 3.3 The Shuffle-Exchange and de Bruijn Graphs 3.3.1 Definitions and Properties 3.3.2 The Diaconis Card Tricks 3.3.3 Simulation of Normal Hypercube Algorithms 3.3.4 Similarities with the Butterfly 3.3.5 Some Containment and Simulation Results 3.4 Packet-Routing Algorithms 3.4.1 Definitions and Routing Models 3.4.2 Greedy Routing Algorithms and Worst-Case Problems 3.4.3 Packing, Spreading, and Monotone Routing Problems -Reducing a Many-to-Many Routing Problem to a Many-to-One Routing Problem -Reducing a Routing Problem to a Sorting Problem 3.4.4 The Average-Case Behavior of the Greedy Algorithm -Bounds on Congestion -Bounds on Running Time -Analyzing Non-Predictive Contention-Resolution Protocols 3.4.5 Converting Worst-Case Routing Problems into Average-Case Routing Problems -Hashing -Randomized Routing 3.4.6 Bounding Queue Sizes -Routing on Arbitrary Levelled Networks 3.4.7 Routing with Combining 3.4.8 The Information Dispersal Approach to Routing -Using Information Dispersal to Attain Fault-Tolerance -Finite Fields and Coding Theory 3.4.9 Circuit-Switching Algorithms 3.5 Sorting 3.5.1 Odd-Even Merge Sort -Constructing a Sorting Circuit with Depth log N(log N +1)/2 3.5.2 Sorting Small Sets 3.5.3 A Deterministic O(log N log log N)-Step Sorting Algorithm 3.5.4 Randomized O(log N)-Step Sorting Algorithms -A Circuit with Depth 7.45 log N that Usually Sorts 3.6 Simulating a Parallel Random Access Machine 3.6.1 PRAM Models and Shared Memories 3.6.2 Randomized Simulations Based on Hashing 3.6.3 Deterministic Simulations using Replicated Data 3.6.4 Using Information Dispersal to Improve Performance 3.7 The Fast Fourier Transform 3.7.1 The Algorithm 3.7.2 Implementation on the Butterfly and Shuffle-Exchange Graph 3.7.3 Application to Convolution and Polynomial Arithmetic 3.7.4 Application to Integer Multiplication 3.8 Other Hypercubic Networks 3.8.1 Butterflylike Networks -The Omega Network -The Flip Network -The Baseline and Reverse Baseline Networks -Banyan and Delta Networks -k-ary Butterflies 3.8.2 De Bruijn-Type Networks -The k-ary de Bruijn Graph -The Generalized Shuffle-Exchange Graph 3.9 Problems 3.10 Bibliographic Notes Bibliography Index Lemmas, Theorems, and Corollaries Author Index Subject Index
TL;DR: A new hypercube-type structure, the folded hypercube (FHC), which is basically a standard hypercube with some extra links established between its nodes, is proposed and analyzed and it is shown that this structure offers substantial improvement over existing hyper cube-type networks in terms of the above-mentioned network parameters.
Abstract: A new hypercube-type structure, the folded hypercube (FHC), which is basically a standard hypercube with some extra links established between its nodes, is proposed and analyzed. The hardware overhead is almost 1/n, n being the dimensionality of the hypercube, which is negligible for large n. For this new design, optimal routing algorithms are developed and proven to be remarkably more efficient than those of the conventional n-cube. For one-to-one communication, each node can reach any other node in the network in at most (n/2) hops (each hop corresponds to the traversal of a single link), as opposed to n hops in the standard hypercube. One-to-all communication (broadcasting) can also be performed in only (n/2) steps, yielding a 50% improvement in broadcasting time over that of the standard hypercube. All routing algorithms are simple and easy to implement. Correctness proofs for the algorithms are given. For the proposed architecture, communication parameters such as average distance, message traffic density, and communication time delay are derived. In addition, some fault tolerance capabilities of this architecture are quantified and compared to those of the standard cube. It is shown that this structure offers substantial improvement over existing hypercube-type networks in terms of the above-mentioned network parameters. >
TL;DR: A new interconnection structure is proposed as a basis for distributed-memory parallel computer architectures that is a variation of the hypercube and preserves many of its desirable properties, including regularity and large vertex connectivity.
Abstract: A new interconnection structure is proposed as a basis for distributed-memory parallel computer architectures. The network is a variation of the hypercube and preserves many of its desirable properties, including regularity and large vertex connectivity. It has the same node and link complexity, but has a diameter only about half of the hypercube's. Some of the basic properties of this topology are discussed. Efficient routing and broadcasting algorithms are presented. >
TL;DR: A hypercube with extra connections added between pairs of nodes through otherwise unused links is investigated and achieves noticeable improvement in diameter, mean internode distance, and traffic density.
Abstract: A hypercube with extra connections added between pairs of nodes through otherwise unused links is investigated. The extra connections are made in a way that maximizes the improvement of the performance measure of interest under various traffic distributions. The resulting hypercube, called the enhanced hypercube, requires a simple routing algorithm and is guaranteed not to create any traffic-congested points or links. The enhanced hypercube achieves noticeable improvement in diameter, mean internode distance, and traffic density, and it also is more cost effective than a regular hypercube. An efficient broadcast algorithm that can considerably speed up the broadcast process in enhanced hypercubes is provided. >
TL;DR: It is shown that by exchanging any two independent edges in any shortest cycle of the n-cube, its diameter decreases by one unit, which leads to the definition of a new class of n-regular graphs, denoted TQ/sub n/, with 2/sup n/ vertices and diameter n-1, which has the (n-1)-cube as subgraph.
Abstract: It is shown that by exchanging any two independent edges in any shortest cycle of the n-cube (n>or=3), its diameter decreases by one unit. This leads to the definition of a new class of n-regular graphs, denoted TQ/sub n/, with 2/sup n/ vertices and diameter n-1, which has the (n-1)-cube as subgraph. Other properties of TQ/sub n/ such as connectivity and the lengths of the disjoints paths are also investigated. Moreover, it is shown that the complete binary tree on 2/sup n/-1 vertices, which is not a subgraph of the n-cube, is a subgraph of TQ/sub n/. How these results can be used to enhance hypercube multiprocessors is discussed. >
TL;DR: Three particular algorithms are described: replicated data (RD); systolic loop (SLS-G), and parallelised link-cells (PLC), all of which have good load balancing and the performance characteristics of each algorithm and the factors affecting their scaling properties are discussed.
TL;DR: It is shown that the problem of determining whether an n-cube with an arbitrary number of link faults has a Hamiltonian circuit is NP-complete.
Abstract: The problem of finding Hamiltonian circuits in faulty hypercubes is explored. There are many different Hamiltonian circuits in a nonfaulty hypercube. The question of interest here is the following: if a certain number of links are removed from the hypercube, will a Hamiltonian circuit still exist? In partial answer to this question are the following results. First, it is shown that for any n-cube $( n\geqq 3 )$ with $\leqq 2n - 5$ link faults in which each node is incident to at least two nonfaulty links, there exists a Hamiltonian circuit consisting of only nonfaulty links. Since as will be shown, there exists an n-cube with $2n - 4$ faulty links, in which each node is incident to at least two nonfaulty links, for which there is no Hamiltonian circuit, this result is optimal. Second, it is shown that the problem of determining whether an n-cube with an arbitrary number of link faults has a Hamiltonian circuit is NP-complete.
TL;DR: In this article, a parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed.
Abstract: A parallel simulated annealing algorithm that is problem-independent, maintains the serial decision sequence, and obtains speedup which can exceed log/sub 2/P on P processors is discussed. The algorithm achieves parallelism by using the concurrency technique of speculative computation. Implementation of the parallel algorithm on a hypercube multiprocessor and application to a task assignment problem are described. The simulated annealing solutions are shown to be, on average, 28% better than the solutions produced by a random task assignment algorithm and 2% better than the solutions produced by a heuristic. >
TL;DR: By rearranging some of the connections in the hypercube, the Mobius cubes are obtained which have smaller distances (as measured in communication links) between processors.
Abstract: The Mobius cubes are hypercube variants that give better performance with the same number of links and processors. We show that the diameter of the Mobius cubes is about one half the diameter of the equivalent hypercube, and that the average number of steps between processors for a Mobius cube is about two-thirds of the average for a hypercube. We give an efficient routing algorithm for the Mobius cubes. This routing algorithm finds a shortest path and operates in time proportional to the dimension of the cube. We also give efficient broadcast algorithms for the Mobius cubes. We show that the Mobius cubes contain ring networks and other networks. We report results of simulation studies on the dynamic message-passing performance of the hypercube, the Twisted Cube of P.A.J. Hilbers et al. (1987), and the Mobius cubes. Our results are in agreement with S. Abraham (1990), showing that the Twisted Cube has worse dynamic performance than the hypercube, but our results show that the 1-Mobius cube has dynamic performance superior to that of the hypercube. This contradicts current literature, which implies that twisted cube variants will have worse dynamic performance. >
TL;DR: In this article, a modified hypercube topology is described, which adds additional communication links between the most distant nodes of a classic hypercube, which is termed as a Modified Hypercube (MH) topology.
Abstract: A hypercube system which has been modified by adding additional communication links between the most distant nodes of a classic hypercube topology is described herein. This improvement in a hypercube topology is termed as a Modified Hypercube topology. Such a topology contains extra links which connects a node to another node in the topology which requires the greatest number of nodal hops over the shortest path. Also stated another way, that node having the greatest number of singly traversed or hopped nodes along the shortest path from an originating node to that node makes that node the most distant processor node. If hamming were to be implemented in the system, there is added an extra link between two nodes having the greatest hamming distance. Such a system makes a technological trade off to reduce the diameter of a classic hypercube at the cost of incrementally increasing the number of I/O ports at each node. This trade off has been recognized in the industry as advantageous since a great gain in performance is achieved n exchange for an incremental impact to the hardware. Clearly the performance advantages of the present invention grows as the number of nodes in the hypercube grows and the maximum distance between nodes increases.
TL;DR: It is shown that the free list policy is optimal in a static environment, as are the other policies, and it also gives better subcube recognition ability compared to the previous schemes in a dynamic environment.
Abstract: An efficient processor allocation policy is presented for hypercube computers. The allocation policy is called free list since it maintains a list of free subcubes available in the system. An incoming request of dimension k (2/sup k/ nodes) is allocated by finding a free subcube of dimension k or by decomposing an available subcube of dimension greater than k. This free list policy uses a top-down allocation rule in contrast to the bottom-up approach used by the previous bit-map allocation algorithms. This allocation scheme is compared to the buddy, gray code (GC), and modified buddy allocation policies reported for the hypercubes. It is shown that the free list policy is optimal in a static environment, as are the other policies, and it also gives better subcube recognition ability compared to the previous schemes in a dynamic environment. The performance of this policy, in terms of parameters such as average delay, system utilization, and time complexity, is compared to the other schemes to demonstrate its effectiveness. The extension of the algorithm for parallel implementation, noncubic allocation, and inclusion/exclusion allocation is also given. >
TL;DR: A graph-based solution to the mapping problem using the simulated annealing optimization heuristic and implemented using the hypercube as a host architecture, and results for several image graphs are presented.
Abstract: A graph-based solution to the mapping problem using the simulated annealing optimization heuristic is developed. An automated two-phase mapping strategy is formulated: process annealing assigns parallel processes to processing nodes, and connection annealing schedules traffic connections on network data links so that interprocess communication conflicts are minimized. To evaluate the quality of generated mappings. cost functions suitable for simulated annealing that accurately quantify communications overhead are derived. Communication efficiency is formulated to measure the quality of assignments when the optimal mapping is unknown. The mapping scheme is implemented using the hypercube as a host architecture, and results for several image graphs are presented. >
TL;DR: The main thrust is to explore the match between the algorithms, their implementation, and the machine architectures, and to present various considerations together with the results.
Abstract: Both the very dishonest Newton (VDHN) and the successive over relaxed (SOR) Newton algorithms have been implemented on the iPSC/2 and Alliant FX/8 computers for power system dynamic simulation using complex generator and nonlinear load models. The main thrust is to explore the match between the algorithms, their implementation, and the machine architectures. For example, the less parallel but sequentially faster VDHN runs faster on the hypercube (iPSC/2) whereas the more parallel SOR-Newton requires data sharing more often because of the extra iterations and does better on the Alliant. The implementation on the hypercube requires significant manual programming to schedule the processors and their communication whereas the compiler in the Alliant recognizes parallel steps but only if the software is properly coded. The authors present these various considerations together with the results. >
TL;DR: A general theory for modeling and designing fault-tolerant multiprocessor systems in a systematic and efficient manner is presented and the resulting designs are shown to be far superior to those proposed in previous work.
TL;DR: A high-efficiency medium-grained parallel spectral element method for numerical solution of the Stokes problem in general domains and the performance of this algorithm-architecture coupling is evaluated in a technical and economic framework that reflects the true advantages of parallel solution of partial differential equations.
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, which aims to provide real-time information about the response of the immune system to natural disasters.
Abstract: Mathematics Department and 07960 Laboratory for Computer Science Massachusetts Institute of Technology
TL;DR: In this article, the authors use the isoefficiency metric to analyze the scalability of parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph, and find the classic trade-offs of hardware cost vs scalability and memory vs time to be represented here as tradeoffs of HPCs vs. scalability.
TL;DR: A class of algorithms for finding the global minimum of a continuous-variable function defined on a hypercube, based on both diffusion processes and simulated annealing, are presented, and it is shown that “learning” in these networks can be achieved by a set of three interconnected diffusion machines.
Abstract: The first purpose of this paper is to present a class of algorithms for finding the global minimum of a continuous-variable function defined on a hypercube. These algorithms, based on both diffusion processes and simulated annealing, are implementable as analog integrated circuits. Such circuits can be viewed as generalizations of neural networks of the Hopfield type, and are called "diffusion machines." Our second objective is to show that "learning" in these networks can be achieved by a set of three interconnected diffusion machines: one that learns, one to model the desired behavior, and one to compute the weight changes.
TL;DR: It is proved that the average delay T per packet satisfies T < AdE p, thus showing that an average delay of O(d) is attainable for any fixed p < 1, and based on a stochastic comparison with a product-form network.
Abstract: We analyze the following problem: Each node of the d-dimensional hypercube independently generates packets according to a Poisson process with rate A. Each of the packets is to be sent to a randomly chosen destination; each of the nodes at Hamming distance k from a packet's origin is assigned an a priori probability pk (1 _p)d- k. Packets are routed under a simple greedy scheme: each of them is forced to cross the hypercube dimensions required in increasing index-order, with possible queueing at the hypercube nodes. Assuming unit packet length and no other communications taking place, we show that this scheme is stable (in steady-state) if p < 1, where pde Ap is the load factor of the network; this is seen to be the broadest possible range for stability. Furthermore, we prove that the average delay T per packet satisfies T < AdE p , thus showing that an average delay of O(d) is attainable for any fixed p < 1. We also establish similar results in the context of the butterfly network. Our analysis is based on a stochastic comparison with a product-form network.
TL;DR: It may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube parallel processing computer, and the parallelLU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling.
Abstract: Two new coarse-grain scheduling schemes, the levelwise and factorization path scheduling schemes, are examined. These schemes differ significantly from fine-grain scheduling schemes which have been proposed in the past. If a fine-grain scheduling scheme at the floating-point-operation level is an appropriate scheduling method for the iPSC hypercube parallel processing computer, then the levelwise scheduling scheme presented should have gain comparable to that obtained using the factorization path scheduling scheme. Since this is not the case, it may be concluded that a fine-grain scheduling scheme is not appropriate for parallel LU factorization using an iPSC hypercube. Furthermore, the parallel LU factorization implementation using factorization path scheduling was found to perform significantly better than levelwise scheduling. The maximum speedup of 2.08 was obtained by using four processors on the 494 bus system. The efficiency at maximum speedup was 52.1%. >
TL;DR: The algorithm is adaptive and it is shown that this is necessary to achieve the logarithmic speedup, and generalize the Borodin-Hopcroft lower bound on oblivious routing by proving that any randomized oblivious algorithm on a polylogarithic degree network requires at least Ω(log2N/log logN) bit steps with high probability for almost all permutations.
Abstract: In this paper we describe anO(logN)-bit-step randomized algorithm for bit-serial message routing on a hypercube. The result is asymptotically optimal, and improves upon the best previously known algorithms by a logarithmic factor. The result also solves the problem of on-line circuit switching in anO(1)-dilated hypercube (i.e., the problem of establishing edge-disjoint paths between the nodes of the dilated hypercube for any one-to-one mapping). Our algorithm is adaptive and we show that this is necessary to achieve the logarithmic speedup. We generalize the Borodin-Hopcroft lower bound on oblivious routing by proving that any randomized oblivious algorithm on a polylogarithmic degree network requires at least Ω(log2
N/log logN) bit steps with high probability for almost all permutations.
TL;DR: An optimal algorithm for performing the communication described by exchanging the bits of the node address with that of the local address is described, typically in both matrix transposition and bit reversal for the fast Fourier transform.
TL;DR: The third-generation C* compiler for hypercube multicomputers incorporates new optimization and utilizes an improved set of comnlunication primitives, and it allows the programmer to specify a custom mapping of data to the distributed memories of the hypercube.
Abstract: We describe our third-generation C* compiler for hy percube multicomputers. This compiler generates code suitable for execution on both the nC;UBE 3200 and the Intel iPSC/2. The compiler incorporates new optimization and utilizes an improved set of comnlunication primitives. It supports a variety of standard clomain clecomposition primitives, and it also allows the programmer to specify a custom mapping of data to the distributed memories of the hypercube. The performance of this compiler on benchmark programs clenlonstrates that high efficiency can be achieved executing SIMD code on multicomputer architectures.
TL;DR: In this paper, it was shown that in dimension greater than four, the minimal area hypersurface separating the faces of a hypercube is the cone over the edges of the hypercube, even if the area separating opposite faces is given zero weight.
Abstract: It is shown that in dimension greater than four, the minimal area hypersurface separating the faces of a hypercube is the cone over the edges of the hypercube This constrasts with the cases of two and three dimensions, where the cone is not minimal For example, a soap film on a cubical frame has a small rounded square in the center In dimensions over 6, the cone is minimal even if the area separating opposite faces is given zero weight The proof uses the maximal flow problem that is dual to the minimal surface problem
TL;DR: This paper presents a general strategy for finding efficient permutation routes in parallel networks and investigates the use of this algorithm for routingmultiple permutations and extends its applicability to a wide class of graphs, including several families of Cayley graphs.
Abstract: In this paper we present a general strategy for finding efficient permutation routes in parallel networks. Among the popular parallel networks to which the strategy applies are mesh networks, hypercube networks, hypercube-derivative networks, ring networks, and star networks. The routes produced are generally congestion-free and take a number of routing steps that is within a small constant factor of the diameter of the network. Our basic strategy is derived from an algorithm that finds (in polynomial time) efficient permutation routes for aproduct network, G×H, given efficient permutation routes forG andH. We investigate the use of this algorithm for routingmultiple permutations and extend its applicability to a wide class of graphs, including several families ofCayley graphs. Finally, we show that our approach can be used to find efficient permutation routes among the remaining live nodes infaulty networks.
TL;DR: Three forms of the matrix sign function are implemented and tested on a distributed memory hypercube multiprocessor and performance results indicate that the method is an excellent means of solving large-scale problems on a parallel computer.
Abstract: The matrix sign function is the basis of a parallel algorithm for solving the generalized algebraic Riccati equation. Three forms of the algorithm were implemented and tested on a distributed memory hypercube multiprocessor. Performance results indicate that the method is an excellent means of solving large-scale problems on a parallel computer.
TL;DR: In this article, a unified multiphase algorithm for complete exchange on a hypercube of dimension d and block size m is described, which is applicable to all circuit-switched hypercubes that use the common e-cube routing strategy.
Abstract: On a distributed memory parallel computer, the complete exchange (all-to-all personalized) communication pattern requires each of n processors to send a different block of data to each of the remaining n - 1 processors. This pattern is at the heart of many important algorithms, most notably the matrix transpose. For a circuit switched hypercube of dimension d(n = 2(sup d)), two algorithms for achieving complete exchange are known. These are (1) the Standard Exchange approach that employs d transmissions of size 2(sup d-1) blocks each and is useful for small block sizes, and (2) the Optimal Circuit Switched algorithm that employs 2(sup d) - 1 transmissions of 1 block each and is best for large block sizes. A unified multiphase algorithm is described that includes these two algorithms as special cases. The complete exchange on a hypercube of dimension d and block size m is achieved by carrying out k partial exchange on subcubes of dimension d(sub i) Sigma(sup k)(sub i=1) d(sub i) = d and effective block size m(sub i) = m2(sup d-di). When k = d and all d(sub i) = 1, this corresponds to algorithm (1) above. For the case of k = 1 and d(sub i) = d, this becomes the circuit switched algorithm (2). Changing the subcube dimensions d, varies the effective block size and permits a compromise between the data permutation and block transmission overhead of (1) and the startup overhead of (2). For a hypercube of dimension d, the number of possible combinations of subcubes is p(d), the number of partitions of the integer d. This is an exponential but very slowly growing function and it is feasible over these partitions to discover the best combination for a given message size. The approach was analyzed for, and implemented on, the Intel iPSC-860 circuit switched hypercube. Measurements show good agreement with predictions and demonstrate that the multiphase approach can substantially improve performance for block sizes in the 0 to 160 byte range. This range, which corresponds to 0 to 40 floating point numbers per processor, is commonly encountered in practical numeric applications. The multiphase technique is applicable to all circuit-switched hypercubes that use the common e-cube routing strategy.
TL;DR: This work examines design alternatives for ordered radix-2 DIF (decimation-in-frequency) FFT algorithms on massively parallel hypercube multiprocessors such as the Connection Machine and combines the order and computational phases of the FFT and also uses sequence to processor maps that reduce communication.
TL;DR: This work gives an efficient algorithm to find the minimum-cost way to evaluate an expression, for several different data parallel architectures, and applies to any architecture in which the metric describing the cost of moving an array has a property the authors call “robustness".
TL;DR: A broadcasting algorithm that disseminates information throughout the whole network in time a log n with probability exceeding 1-bn/sup -c/ with positive constants a, b, c depending on p, provided that p
Abstract: The problem of broadcasting information in an n-node hypercube in which links fail independently with fixed probability 0 >