Open Access
Matrix Multiplication on Boolean Cubes using Generic Communication Primitives
Lennart Johnsson
- 01 Jan 1989
- pp 108-156
35
TL;DR: Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives.
read more
Abstract: Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives. We describe generic shared memory primitives such as one-to-all and all-to-all broadcasting, and one-to-all and all-to-all personalized communication, and implementations theoref thar are within a factor of two of the best known lower bounds. We describe algorithms for the multiplication of arbitrarily shaped matrices using these primitives. Of the three loops required for a standard matrix multiplication algorithm expressed in Fortran all three can be parallelised. We show that if one loop is parallelised, then the processors shall be aligned with the loops having the most elements. Depending on the initial matrix allocation data permutatuions may be required to accomplish the processor/loop alignment. This permutation id included in our analysis. We show that in parallelizing two loops the optimum aspect ratio of the processing plane is equal to the ratio of the number of matrix elements in the two loops being parallelized
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Efficient algorithms for all-to-all communications in multiport message-passing systems
TL;DR: This work presents efficient algorithms for two all-to-all communication operations in message-passing systems: index and concatenation, both of which are based on the communication start-up time and the communication bandwidth.
What have we learnt from using real parallel machines to solve real problems
Geoffrey C. Fox
- 03 Jan 1989
TL;DR: A space-time analogy is used to classify problems and shows how a division into synchronous, loosely synchronous and asynchronous problems is helpful and isolates the asynchronous class as that for which major uncertainties as to possible parallelism exist.
91
Optimal broadcast in all-port wormhole-routed hypercubes
Ching-Tien Ho,Ming-Yang Kao +1 more
TL;DR: An optimal algorithm that broadcasts on an n-dimensional hypercube in O(n/ log/sub 2/ (n+1)) routing steps with wormhole, e-cube routing and all-port communication is given.
67
Compiling parallel programs by optimizing performance
TL;DR: This paper describes how Crystal, a language based on familiar mathematical notation and lambda calculus, addresses the issues of programmability and performance for parallel supercomputers and illustrates the power of its approach with benchmarks of compiled parallel code from Crystal source.
67
Fast Gossiping by Short Messages
TL;DR: This paper considers the problem of gossiping in communication networks under the restriction that communicating nodes can exchange up to a fixed number p of packets at each round, and determines the optimal number of communication rounds to perform gossiping for several classes of graphs, including Hamiltonian graphs and complete k-ary trees.