About: Coppersmith–Winograd algorithm is a research topic. Over the lifetime, 60 publications have been published within this topic receiving 1949 citations.
TL;DR: In this paper, Cook et al. gave an algorithm which computes the coefficients of the product of two square matrices A and B of order n with less than 4. 7 n l°g 7 arithmetical operations (all logarithms in this paper are for base 2).
Abstract: t. Below we will give an algorithm which computes the coefficients of the product of two square matrices A and B of order n from the coefficients of A and B with tess than 4 . 7 n l°g7 arithmetical operations (all logarithms in this paper are for base 2, thus tog 7 ~ 2.8; the usual method requires approximately 2n 3 arithmetical operations). The algorithm induces algorithms for invert ing a matr ix of order n, solving a system of n linear equations in n unknowns, comput ing a determinant of order n etc. all requiring less than const n l°g 7 arithmetical operations. This fact should be compared with the result of KLYUYEV and KOKOVKINSHCHERBAK [1 ] tha t Gaussian elimination for solving a system of l inearequations is optimal if one restricts oneself to operations upon rows and columns as a whole. We also note tha t WlNOGRAD [21 modifies the usual algorithms for matr ix multiplication and inversion and for solving systems of linear equations, trading roughly half of the multiplications for additions and subtractions. I t is a pleasure to thank D. BRILLINGER for inspiring discussions about the present subject and ST. COOK and B. PARLETT for encouraging me to write this paper. 2. We define algorithms e~, ~ which mult iply matrices of order m2 ~, by induction on k: ~ , 0 is the usual algorithm, for matr ix multiplication (requiring m a multiplications and m 2 ( m t) additions), e~,k already being known, define ~ , ~ +t as follows: If A, B are matrices of order m 2 k ~ to be multiplied, write
TL;DR: In this article, it was shown that ω ≤ log 2 7 < 2.8074, which is better than the value of 3 we had previously, and showed how cubing and raising to the fourth power of Coppersmith and Winograd's complicated algorithm can improve the precision of matrix multiplication.
Abstract: The evaluation of the product of two matrices can be very computationally expensive. The multiplication of two n×n matrices, using the “default” algorithm can take O(n3) field operations in the underlying field k. It is therefore desirable to find algorithms to reduce the “cost” of multiplying two matrices together. If multiplication of two n× n matrices can be obtained in O(nα) operations, the least upper bound for α is called the exponent of matrix multiplication and is denoted by ω. A bound for ω < 3 was found in 1968 by Strassen in his algorithm. He found that multiplication of two 2× 2 matrices could be obtained in 7 multiplications in the underlying field k, as opposed to the 8 required to do the same multiplication previously. Using recursion, we are able to show that ω ≤ log2 7 < 2.8074, which is better than the value of 3 we had previously. In chapter 1, we look at various techniques that have been found for reducing ω. These include Pan’s Trilinear Aggregation, Bini’s Border Rank and Schonhage’s Asymptotic Sum inequality. In chapter 2, we look in detail at the current best estimate of ω found by Coppersmith and Winograd. We also propose a different method of evaluating the “value” of trilinear forms. Chapters 3 and 4 build on the work of Coppersmith and Winograd and examine how cubing and raising to the fourth power of Coppersmith and Winograd’s “complicated” algorithm affect the value of ω, if at all. Finally, in chapter 5, we look at the Group-Theoretic context proposed by Cohn and Umans, and see how we can derive some of Coppersmith and Winograd’s values using this method, as well as showing how working in this context can perhaps be more conducive to showing ω = 2.
TL;DR: In this article, a new parallel algorithm based on Strassen's fast matrix multiplication algorithm is presented, which is communication-optimal and exhibits perfect strong scaling within the maximum possible range.
Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range.Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203.Our parallelization approach generalizes to other fast matrix multiplication algorithms.
TL;DR: A memory-independent communication cost lower bound is obtained on classical and Strassen-based distributed-memory matrix multiplication algorithms that imply that no classical or Strassan-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms.
Abstract: A parallel algorithm has perfect strong scaling if its running time on $P$ processors is linear in $1/P$, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.
TL;DR: The results of numerical tests clearly indicate that the accelerated matrix multiplication routines implementing two or three disjoint product-based algorithm are comparable in computational time with an implementation of Winograd algorithm and clearly outperform it with respect to working space and (especially) numerical stability.