TL;DR: In particular, when N = P 1 + 1 k and k is a constant, Cubesort sorts on the above parallel computers in O(N log N P ) time, thus obtaining an optimal processor-time product for comparison sorting.
TL;DR: It is shown that no monotonically decreasing increment sequence will yield an optimal size sorting network and a sorting algorithm called Cubesort is presented, which is the fastest known sorting algorithm for a variety of parallel computers over a wide range of parameters.
Abstract: A fundamental operation in parallel computation is sorting. Sorting is important not only because it is required by many algorithms, but also because it can be used to implement irregular, pointer-based communication. We study two algorithms for sorting in massively parallel computers. First, we examine Shellsort. Shellsort is a sorting algorithm that is based on a sequence of parameters called increments. Shellsort can be used to create a parallel sorting device known as a sorting network. Researchers have suggested that if the correct increment sequence is used, an optimal size sorting network can be obtained. All published increment sequences have been monotonically decreasing. We show that no monotonically decreasing increment sequence will yield an optimal size sorting network. Second, we present a sorting algorithm called Cubesort. Cubesort is the fastest known sorting algorithm for a variety of parallel computers over a wide range of parameters.
We also present a paradigm for developing parallel algorithms that have efficient communication. The paradigm, called the data reduction paradigm, consists of using a divide-and-conquer strategy. Both the division and combination phases of the divide-and-conquer algorithm may require irregular, pointer-based communication between processors. However, the problem is divided so as to limit the amount of data that must be communicated. As a result the communication can be performed efficiently. We present data reduction algorithms for the image component labeling problem, the closest pair problem and four versions of the parallel prefix problem.
TL;DR: A new algorithm is proposed, called cubesort, that sorts N=P1+1/k items in O(k P 1/k log P) time using a P processor shuffle-exchange, which provides an asymptotically optimal speed-up over sequential sorting.
Abstract: This paper studies the problem of sorting N items on a P processor parallel machine, where N≥P. The central result of the paper is a new algorithm, called cubesort, that sorts N=P1+1/k items in O(k P1/k log P) time using a P processor shuffle-exchange. Thus for any positive constant k, cubesort provides an asymptotically optimal speed-up over sequential sorting. Cubesort also sorts N = P log P items using a P processor shuffle-exchange in O(log3 P/loglog P) time. Both of these results are faster than any previously published algorithms for the given problems. Cubesort also provides asymptotically optimal sorting algorithms for a wide range of parallel computers, including the cube-connected cycles and the hypercube. An important extension of the central result is an algorithm that simulates a single step of a Priority-CRCW PRAM with N processors and N words of memory on a P processor shuffle-exchange machine in O(k P1/k log P) time, where N=P1+1/k.