TL;DR: This paper persuaded Intel to modify its commercial C/C++ compiler to include pedigrees, and built a library implementation of a deterministic parallel random-number generator called DotMix that compresses the pedigree and then "RC6-mixes" the result.
Abstract: Existing concurrency platforms for dynamic multithreading do not provide repeatable parallel random-number generators. This paper proposes that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation. Experiments with the open-source MIT Cilk runtime system show that the overhead for maintaining pedigrees is negligible. Specifically, on a suite of 10 benchmarks, the relative overhead of Cilk with pedigrees to the original Cilk has a geometric mean of less than 1%.We persuaded Intel to modify its commercial C/C++ compiler, which provides the Cilk Plus concurrency platform, to include pedigrees, and we built a library implementation of a deterministic parallel random-number generator called DotMix that compresses the pedigree and then "RC6-mixes" the result. The statistical quality of DotMix is comparable to that of the popular Mersenne twister, but somewhat slower than a nondeterministic parallel version of this efficient and high-quality serial random-number generator. The cost of calling DotMix depends on the "spawn depth" of the invocation. For a naive Fibonacci calculation with n=40 that calls DotMix in every node of the computation, this "price of determinism" is a factor of 2.65 in running time, but for more realistic applications with less intense use of random numbers -- such as a maximal-independent-set algorithm, a practical samplesort program, and a Monte Carlo discrete-hedging application from QuantLib -- the observed "price" was less than 5%. Moreover, even if overheads were several times greater, applications using DotMix should be amply fast for debugging purposes, which is a major reason for desiring repeatability.
TL;DR: In many of the remaining cases, the new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm, confirming the claims made about the robust performance of the algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.
Abstract: We present sorting algorithms that represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. A part of the speed advantage is due to the feature to work in-place. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our comparison-based algorithm, In-place Superscalar Samplesort (IPS$^4$o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best in-place parallel comparison-based competitor by almost a factor of three. IPS$^4$o also outperforms the best comparison-based competitors in the in-place or not in-place, parallel or sequential settings. IPS$^4$o even outperforms the best integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new in-place radix sorter turns out to be the best algorithm. Claims to have the, in some sense, "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on extensive experiments involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.
TL;DR: This paper presents a performance study for many of today's popular parallel sorting algorithms on a large scale MIMD system and develops a theoretical model to predict the performance in terms of communication and computation times.
Abstract: This paper presents a performance study for many of today's popular parallel sorting algorithms. It is the first to present a comparative study on a large scale MIMD system. The machine, a Parsytec GCel, contains 1024 processors connected as a two-dimensional grid. To justify the experimental results, we develop a theoretical model to predict the performance in terms of communication and computation times. We get a very close relation between the experiments and the theoretical model as long as the edge congestion caused by the algorithms is predicted precisely. We compare: Bitonicsort, Shearsort, Gridsort, Samplesort, and Radixsort. Experiments were performed using random instances according to a well known benchmark problem. Results show that for the machine we used, Bitonicsort performs best for smaller numbers of keys per processor ( >
TL;DR: This work proposes a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA, and uses a novel parallel hardware partitioner that is only limited in data set size by available FPGAs hardware resources.
Abstract: Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping "buckets," sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.
TL;DR: A new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers, is presented, able to almost completely hide the computation (sorting) behind the IO latency.
Abstract: In this paper, we present a new out-of-core sort algorithm, designed for problems that are too large to fit into the aggregate RAM available on modern supercomputers. We analyze the performance including the cost of IO and demonstrate the fastest (to the best of our knowledge) reported throughput using the canonical sortBenchmark on a general-purpose, production HPC resource running Lustre. By clever use of available storage and a formulation of asynchronous data transfer mechanisms, we are able to almost completely hide the computation (sorting) behind the IO latency. This latency hiding enables us to achieve comparable execution times, including the additional temporary IO required, between a large sort problem (5TB) run as a single, in-RAM sort and our out-of-core approach using 1/10th the amount of RAM. In our largest run, sorting 100TB of records using 1792 hosts, we achieved an end-to-end throughput of 1.24TB/min using our general-purpose sorter, improving on the current Daytona record holder by 65%.