TL;DR: Results of applying optimal and heuristic procedures on a set of benchmark circuits indicate that heuristic methods give fast and near minimal solutions.
Abstract: A report is presented on procedures investigated to determine flip-flops to be scanned in partial-scan designs for sequential circuits. The main idea pursued is to derive a minimal feedback vertex set of the so-called S-graphs. Results of applying optimal and heuristic procedures on a set of benchmark circuits indicate that heuristic methods give fast and near minimal solutions.<>
TL;DR: The design of a prototype Fortran D compiler for the iPSC/860, a MIMD distributed-memory machine is presented and issues addressed include data decomposition analysis, guard introduction, communications generation and optimization, program transformations, and storage assignment.
Abstract: Because of the complexity and variety of parallel architectures, an efficient machine-independent parallel programming model is needed to make parallel computing truly usable for scientific programmers. We believe that Fortran D, a version of Fortran enhanced with data decomposition specifications, can provide such a programming model. This paper presents the design of a prototype Fortran D compiler for the iPSC/860, a MIMD distributed-memory machine. Issues addressed include data decomposition analysis, guard introduction, communications generation and optimization, program transformations, and storage assignment. A test suite of scientific programs will be used to evaluate the effectiveness of both the compiler technology and programming model for the Fortran D compiler.
TL;DR: This paper presents the basic features of Vienna Fortran along with a set of examples illustrating the use of these features and presents the advantages of a shared memory programming paradigm while explicitly controlling the placement of data.
Abstract: Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
TL;DR: Distributed algorithms are developed for the quickest path problem in an asynchronous communication network to find paths in N to transmit a given amount of data such that the transmission time is minimized.
Abstract: Let N = (V, A, C, L) be a network with node set V, arc set A, positive arc capacity function C, and nonnegative arc lead time function L. The quickest path problems is to find paths in N to transmit a given amount of data such that the transmission time is minimized. In this paper, distributed algorithms are developed for the quickest path problem in an asynchronous communication network. For the one-source quickest path problem, we present three algorithms that require O(rn2) messages and O(2) time, O(n) messages and O(rn) time, and O1+elog w) messages and O(rn1+elog w) time for any e, 0
TL;DR: This paper describes a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively, based on LAPACK sequential codes and use a panel-wrapped mapping ofMatrices to nodes.
Abstract: In this paper, we describe a parallel implementation for the reduction of general and symmetric matrices to Hessenberg and tridiagonal form, respectively. The methods are based on LAPACK sequential codes and use a panel-wrapped mapping of matrices to nodes. Results from experiments on the Intel Touchstone Delta are given.
TL;DR: An implementation of the Cooley-Tukey complex-to-complex FFT on the Connection Machine is described, which is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors.
Abstract: We describe an implementation of the Cooley-Tukey complex-to-complex FFT on the Connection Machine. The implementation is designed to make effective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 floating-point processors. The peak rate of FFT computations local to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision. The same FFT routine is used to perform both one- and multi-dimensional FFT without any explicit data rearrangement. The peak performance for a one-dimensional FFT on data distributed over all processors is 5.4 Gflops/s in 32-bit precision and 3.2 Gflops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 Gflops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is P/2N + log2 N for an FFT on P complex points uniformly distributed among N processors. To achieve this level of storage efficiency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order.
TL;DR: A parallel algorithm for solving multiextremal multidimensional global optimization problems by applying Peano-type space-filling curves and conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established.
Abstract: A parallel algorithm for solving multiextremal multidimensional global optimization problems is proposed. The algorithm is based on reducing multidimensional problems to the one-dimensional ones by applying Peano-type space-filling curves. A new parallel scheme to construct such curves is presented. For reduced optimization problems a parallel global optimization method is constructed. Sufficient conditions of global convergence are investigated. Conditions, which guarantee considerable speedup with respect to the sequential version of the algorithm, are established. Numerical experiments executed on ALLIANT FX/80 are also presented.
TL;DR: This paper outlines two methods which it is believed will play an important role in any distributed memory compiler able to handle sparse and unstructured problems and describes a viable mechanism for tracking and reusing copies of off-processor data.
Abstract: Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.
TL;DR: Results of various benchmarks and a description of new architectural features are presented and the first completed Cray Y-MP C90 supercomputer was delivered to the customer in February 1992.
Abstract: On November 19, 1991 at Albuquerque's Supercomputing '91 Cray Research introduced its new top of the line — the Cray Y-MP C90 supercomputer. With 16 CPUs, 2Gbytes (256 Mwords) of central memory, and a new dual-vector pipeline architecture, the system offers peak performance of 16 Gflops, and unequalled sustained ‘real world’ performance. the first completed system was on-line at the Supercomputing '91, and delivered to the customer in February 1992. In this paper, results of various benchmarks and a description of new architectural features are presented.
TL;DR: An efficient parallelization of the Generalized Feedback Shift Register (GFSR) algorithm for generating pseudorandom numbers is presented and works on any parallel computer where the number of processors is a power of two and requires the same amount of memory per processor as required by the sequential GFSR algorithm.
Abstract: Running huge simulational computations on a system of parallel processors requires the generation of uniform random sequences on each processor Various techniques useful for the generation of parallel random sequences are analyzed for their suitability to parallel architectures An efficient parallelization of the Generalized Feedback Shift Register (GFSR) algorithm for generating pseudorandom numbers is presented The algorithm works on any parallel computer where the number of processors is a power of two and requires the same amount of memory per processor as required by the sequential GFSR algorithm
TL;DR: The new algorithms are compared to the traditional solutionspaths offered by Eispack, tridiagonalization of the band matrix followed by thetridiagonal QR algorithm.
Abstract: Divide and conquer algorithms are formulated for the solution of the eigenvalue problem for symmetric band matrices. The new algorithms are compared to the traditional solutionspaths offered by Eispack , tridiagonalization of the band matrix followed by the tridiagonal QR algorithm.
TL;DR: An asymptotically optimal broadcasting algorithm improving the preceding results and using in the wraparound mesh the constructions of two edge-disjoint spanning trees rooted at a given node and of minimum depth.
Abstract: In this paper we give an algorithm to broadcast a message in a wraparound mesh distributed-memory parallel architecture with parallel monodirectional links. This algorithm uses a general strategy based on the diffusion of the message in edge-disjoint spanning trees. We first present in this setting the results of Saad and Schultz and the improvements obtained by Simmen. We then give an asymptotically optimal broadcasting algorithm improving the preceding results. It uses in the wraparound mesh the constructions of two edge-disjoint spanning trees rooted at a given node and of minimum depth.
TL;DR: It is found out that communication trees are highly efficient means for sending, receiving and gathering the computed data, especially for large processor numbers.
Abstract: The basic principles of a typical sequential Molecular Dynamics (MD) program suitable for the study of solvated biomolecules are described, the inherent parallelism of MD is analysed and strategies for parallelisation are developed. Due to separate treatment of computation and communication a high level of portability is achieved and both tasks can be optimized independently. It is found out that communication trees are highly efficient means for sending, receiving and gathering the computed data, especially for large processor numbers. A current implementation on a transputer system is presented. Due to the tight memory budget slight modifications are necessary. Nevertheless, we get excellent performance with an average degree of parallelization of 82%.
TL;DR: An improved algorithm for universal k-selection in hypercubes shows a maximum speedup of O(log k) over the known result for the same problem in the case kp = O(n).
Abstract: This paper presents an improved algorithm for universal k-selection in hypercubes. The algorithm has a worst-case time complexity of O(n/p log p log (kp)/n) for selecting k smallest numbers from n given numbers in a hypercube of p processors (p⩽n). This result shows a maximum speedup of O(log k) over the known result for the same problem in the case kp = O(n).
TL;DR: Experiments show that the new parallel algorithm proposed for solving the 0/1 knapsack problem has a better performance than Chen et al.'s algorithm.
Abstract: A parallel algorithm based on a technique called delayed dominance is proposed for solving the 0/1 knapsack problem. This parallel algorithm is a modification of Chen, Chern and Jang's algorithm. Experiments show that the new algorithm has a better performance than Chen et al.'s algorithm.
TL;DR: This paper presents several cost-optimal parallel algorithms, under different computation models, to find the most vital edge in a weighted graph.
Abstract: Given a weighted graph G, the weight of a spanning tree T, denoted by w(T), is defined as the total weight of all edges in T. A spanning tree T in G is called a minimum spanning tree if w(T)⩽w(T′) for all spanning trees T′ in G. Let w(G) denote the weight of the minimum spanning tree of G if G is connected; otherwise, w(G) = ∞. An edge e is called a most vital edge in G if w(G−e) ⩾ w(G−e′) for every edge e′ of G where G−e′ denotes the partial graph obtained by removing e′ from G. In this paper, we present several cost-optimal parallel algorithms, under different computation models, to find the most vital edge in a weighted graph.
TL;DR: This paper shows a massive parallel architecture specifically designed to support the Boltzmann machine neural network, its simplicity and reliability together with a low implementation cost.
Abstract: A key task for neural network research is the development of neurocomputers able to speed-up the learning algorithms to allow their application and test in real cases. This paper shows a massive parallel architecture specifically designed to support the Boltzmann machine neural network. The heart of this architecture is its simplicity and reliability together with a low implementation cost. Despite the impressive speedup obtained by accelerating the standard BM algorithm the architecture does not use particular techniques to expose parallelism in the simulating annealing task, such as the change of state of multiple neurons. Features of the architecture include: (1) speed: the architecture allows a speedup of N (N is the number neurons constituting the BM) with respect to standard implementation on sequential machines; (2) low cost: the architecture requires the same amount of memory of a sequential application, the only additional cost is due to the inclusion of an adder for each neuron; (3) WSI capabilities: the processor interconnection is limited to a single bus for any number of implemented processors, the architecture is scalable in terms of number of processors without any software or hardware modification, the simplicity of the processors enables to implement built-in self-test techniques: (4) High weight dynamics: the architecture performs computation by using 32-bit integer values, therefore offering a wide range of variability of weights.
TL;DR: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP) and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer.
Abstract: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP). This approach seeks to reduce the inter-processor data movements and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer. The resulting data structure is of particular relevance to iterative schemes for solving linear systems. Performance results for matrices taken from well-known Linear Programming (LP) test problems are presented and analysed.
TL;DR: The Touchstone Delta is found to have an asymptotic bandwidth of 6.7 MB/s which is 2.4 times faster than the iPSC/860, but only about a quarter of the advertised rate.
Abstract: The Touchstone Delta is found to have an asymptotic bandwidth of 6.7 MB/s which is 2.4 times faster than the iPSC/860, but only about a quarter of the advertised rate of 25 MB/s. The Delta's measured startup time of 61 μs is very little less than the iPSC value of 76 μs, however unlike the iPSC, it is independent (within the error of measurement) of the separation between nodes.
TL;DR: A comparative analysis is completed on the methods for broadcast elimination by reindexing multistep algorithms at the algorithm representation level and decomposing the algorithm and pipelining and routing dataflow for each step.
Abstract: A comparative analysis is completed on the methods for broadcast elimination. Some authors use approaches to determine the best affine schedule. Another approach is by reindexing multistep algorithms at the algorithm representation level, decomposing the algorithm and pipelining and routing dataflow for each step. Transformations at the algorithm model level and some heuristic approaches are also considered.
TL;DR: Both the sequential and parallel execution time of the algorithm ALLEV (ALL Eigen Values) presented in this paper are considerably shorter than the execution times of the vectorized EISPACK-routine TQL1 which uses the QL method.
Abstract: A method for determining all eigenvalues of large real symmetric tridiagonal matrices on multiprocessor system with vector facilities is presented. For finding the eigenvalues of a tridiagonal matrix, the method of the Sturm sequence is a standard method. The method uses bisection first to isolate all eigenvalues, bisection is and then to extract the eigenvalues to a predefined accuracy. For extracting the eigenvalues, bisection is accelerated by a superlinearly convergent zero finder, the Pegasus method. The evaluation of the Sturm sequence is the central component for both isolation and extraction. Some new ideas are presented, such as a method for weighting the values of the characteristics polynomial to avoid under- or overflow, a method for combining the Pegasus method with preceding bisection steps and a vectorization and parallelization strategy over intervals. The method was implemented and the results were measured on a SUPRENUM multiprocessor system with 16 processors and on a CRAY Y-MP8/832 with 8 processors. On the latter machine, both the sequential and parallel execution time of our algorithm ALLEV (ALL Eigen Values) presented in this paper are considerably shorter than the execution times of the vectorized EISPACK-routine TQL1 which uses the QL method.
TL;DR: This study shows that efficient memory use, both in terms of shared memory and cache utilization, is the key to optimal performance when dealing with memory hierarchies such as in the TC2000.
Abstract: The implementation of an efficient hybrid parallel block LU decomposition procedure for dense systems on a BBN TC2000 parallel computer is discussed. The TC2000 is of the MIMD architecture with distributed memory. The key characteristic of this architecture is a hierarchical memory structure (register, cache, local, shared). This study shows that efficient memory use, both in terms of shared memory and cache utilization, is the key to optimal performance when dealing with memory hierarchies such as in the TC2000. Although for a system of equations of fixed size, the Mflops per processor rate decreases as the number of processors increases, almost constant performance has been obtained when the number of equations is increased simultaneously to the number of processors used.
TL;DR: Recently developed parallel programming abstractions are illustrated in a complete example by programming the Jacobi iterative approximation computation to illustrate the way in which the new concepts can assist scaling and portability.
Abstract: Recently developed parallel programming abstractions are illustrated in a complete example by programming the Jacobi iterative approximation computation. The program, written in pseudocode, is designed to illustrate the way in which the new concepts can assist scaling and portability. The specific abstractions exhibited, collectively called the “phase abstractions,” are the data, code and port ensembles and the XYZ programming levels .
TL;DR: The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.
Abstract: We consider the problem of solving unsteady partial differential equations on an MIMD machine. Conventional parallel methods use a data partitioning type approach in which the solution grid at each time-step is divided amongst the available processors. The sequential nature of the time integration is, however, retained. The algorithm presented in this paper makes use of a time-parallel approach, whreby several processors may be employed to solve at several time-steps simultaneously. The time-parallel method enables the inherent parallelism of the extrapolation scheme to be efficiently exploited, allowing a significant increase both in accuracy and in the degree of parallelism. The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.