TL;DR: This paper presents the design and implementation of a HERA (HEterogeneous Reconfigurable Architecture) machine that employs FPGAs to allow the simultaneous execution of a variety of parallel processing modes, including SIMD, MIMD, and MSIMD.
Abstract: The high price, long design and development cycles, programming difficulty and high maintenance cost of supercomputers limit their range of potential applications. Recent advances in Field-Programmable Gate Arrays (FPGAs) have made feasible the development of highperformance and programmable parallel systems on a programmable chip (PSOPC). PSOPC’s yield highperformance at low cost for many parallel applications. We present in this paper the design and implementation of our HERA (HEterogeneous Reconfigurable Architecture) machine that employs FPGAs to allow the simultaneous execution of a variety of parallel processing modes, including SIMD (Single-Instruction, Multiple-Data), MIMD (Multiple-Instruction, Multiple-Data) and MSIMD (Multiple-SIMD). The processing element is centered on a single-precision IEEE 754 floating-point unit (FPU) and employs a 7-stage pipeline. To demonstrate the robustness and viability of our approach, we propose a data partitioning scheme and employ mixedmode scheduling for Cannon’s matrix-matrix multiplication algorithm with matrices of arbitrary size and shape. Performance results on our 64-PE machine that employs a dual-FPGA system are better than the optimized performance on a dual-Xeon PC.
TL;DR: A new aggregation pattern (Stripe-continuous aggregation pattern), which fully considers the stripping mechanism and lock protocol of Lustre file system, is proposed to improve the performance of Collective I/O of Cannon's program.
Abstract: Matrix multiplication is one of the most important operations in linear algebra, widely used in many fields of science and engineering. Cannon's algorithm is a classical distributed algorithm for matrix multiplication for two-dimensional meshes. Generally, MPI-IO is used for its I/O requirements. However it has been well documented that MPI-IO performs poorly in a Lustre file system environment. As the scale of matrix multiplication increased, this problem trends to be serious, becoming one key factor impacting performance of the program. In order to improve the performance of Collective I/O of Cannon's program, we proposed a new aggregation pattern (Stripe-continuous aggregation pattern), which fully considers the stripping mechanism and lock protocol of Lustre file system. The theoretical analysis and experimental results show that the pattern can make full use of the capacity of Lustre file system compared with the other patterns, and improve the I/O performance of the Cannon's program efficiently.
TL;DR: The NavP methodology is based on the principle of self-migrating computations and is truly incremental, in that each step represents a functioning program and every intermediate program is an improvement over its predecessor.
Abstract: We show how a series of transformations can be applied to incrementally parallelize sequential programs. Our navigational programming (NavP) methodology is based on the principle of self-migrating computations and is truly incremental, in that each step represents a functioning program and every intermediate program is an improvement over its predecessor. The transformations are mechanical and straightforward to apply. We illustrate our methodology in the context of matrix multiplication. Our final stage is similar to the classical Gentleman's algorithm. The NavP methodology is conducive to new ways of thinking that lead to ease of programming and high performance.
TL;DR: The case in which, using the generalized Cannon's algorithm, it is possible to reduce communications in matrix multiplication is discussed, and two strategies are proposed to solve the problem of multiplying two large squared matrices.
Abstract: In this paper we discuss the case in which, using the generalized Cannon's algorithm, it is possible to reduce communications in matrix multiplication We then apply reduction of communications to the case in which we have to multiply large matrices, in particular rectangular matrices Two strategies are proposed to solve the problem of multiplying two large squared matrices For the case in which we have to deal with small matrices, some methods are proposed to use the entire number of processors
TL;DR: The method is used to derive a new presentation for the Lyons sporadic group and is devoted to describing heuristics to improve the efficiency of Cannon's algorithm.
Abstract: This article is devoted to describing heuristics to improve the efficiency of Cannon's algorithm As an application, the method is used to derive a new presentation for the Lyons sporadic group