About: qsort is a research topic. Over the lifetime, 23 publications have been published within this topic receiving 470 citations. The topic is also known as: quick sort function.
TL;DR: This paper identifies two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggests simple modifications to the standard that remedy the situation and presents performance results of a prototype implementation of an OpenMP subset on a NOW.
Abstract: We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In order to support OpenMP execution on NOWs, our compiler targets a software distributed shared memory system (DSM) which provides multi-threaded execution and memory consistency. This paper presents two contributions. First, we identify two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggest simple modifications to the standard that remedy the situation. These problems reflect differences in memory architecture between software and hardware shared memory and the high cost of synchronization on NOWs. Second, we present performance results of a prototype implementation of an OpenMP subset on a NOW, and compare them with hand-coded software DSM and MPI results for the same applications on the same platform. We use five applications (ASCI Sweep3d, NAS 3D- FFT, SPLASH-2 Water, QSORT, and TSP) exhibiting various styles of parallelization, including pipelined execution, data parallelism, coarse-grained parallelism, and task queues. The measurements show little difference between OpenMP and hand-coded software DSM, but both are still lagging behind MPI. Further work will concentrate on compiler optimization to reduce these differences.
TL;DR: The general method works against any implementation of quicksort – even a randomizing one – that satisfies certain very mild and realistic assumptions.
TL;DR: Detailed performance evaluations are presented for six ACM algorithms, and quicksort requires the fewest comparisons to sort random arrays and qsort requires many more comparisons than its author claims.
Abstract: Detailed performance evaluations are presented for six ACM algorithms: quicksort (No. 64), Shellsort (No. 201), stringsort (No. 207), “TREESORT3” (No. 245), quickersort (No. 271), and qsort (No. 402). Algorithms 271 and 402 are refinements of algorithm 64, and all three are discussed in some detail. The evidence given here demonstrates that qsort (No. 402) requires many more comparisons than its author claims. Of all these algorithms, quickersort requires the fewest comparisons to sort random arrays.
TL;DR: The implementation of Rajasekaran''s (l,m)-mergesort algorithm (LMM) for sorting on parallel disks is discussed, which is asymptotically optimal for large problems and has the additional advantage of a low constant in its I/O complexity.
Abstract: This paper discusses our implementation of Rajasekaran''s (l,m)-mergesort algorithm (LMM) for sorting on parallel disks. LMM is asymptotically optimal for large problems and has the additional advantage of a low constant in its I/O complexity. Our implementation is written in C using the ViC* I/O API for parallel disk systems. We compare the performance of LMM to that of the C library function qsort on a DEC Alpha server. qsort makes a good benchmark because it is fast and performs comparatively well under demand paging. Since qsort fails when the swap disk fills up, we can only compare these algorithms on a limited range of inputs. Still, on most out-of-core problems, our implementation of LMM runs between 1.5 and 1.9 times faster than qsort, with the gap widening with increasing problem size.
TL;DR: This paper proposes an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors and shows the extensibility of the vectorized kernels to processing units with a varying of vector lanes.
Abstract: Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C standard library and ~10x faster than C++'s std::sort. Compared with the Intel Performance Primitives (IPP) library which is one of the most efficient CPU-based radix sort implementation, we obtain a speedup of 1.3 to 2.6. Furthermore, we achieve a peak performance of sorting 1.14 billion floats per second on a Xeon Phi 7210 processor. Moreover, we show the extensibility of our vectorized kernels to processing units with a varying of vector lanes.