TL;DR: This paper describes a simple solution to this dilemma: limit the depth of partitioning, and for subproblems that exceed the limit switch to another algorithm with a better worst‐case bound.
TL;DR: The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions, which facilitates optimizations like loop unrolling and software pipelining.
Abstract: Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions. This transformation facilitates optimizations like loop unrolling and software pipelining. The final implementation, albeit cache efficient, is limited by a linear number of memory accesses rather than the \(\mathcal{O}\!\left(n\log n\right)\) comparisons. On an Itanium 2 machine, we obtain a speedup of up to 2 over std::sort from the GCC STL library, which is known as one of the fastest available quicksort implementations.
TL;DR: In this article, a network hub and Asynchronous Transfer Mode (ATM) translator system for use in a Local Area Network (LAN)-based communications system is disclosed, which includes a host controller that serves as the LAN hub, and interfaces with a translator card.
Abstract: A network hub and Asynchronous Transfer Mode (ATM) translator system ( 5 ) for use in a Local Area Network (LAN)-based communications system is disclosed. The network hub and ATM translator system ( 5 ) includes a host controller ( 10 ) that serves as the LAN hub, and which interfaces with a translator card ( 15 ) which includes a segmentation and reassembly device ( 12 ) in connection with SONET receive/transmit circuitry ( 20 ) that communicates with a transceiver ( 22 ) to transmit and receive ATM packet cells over a communications facility (FO). The translator card ( 15 ) also includes a scheduler ( 14 ) that includes a heap sort state machine ( 36 ) which maintains a sorted list of entries, in a heap fashion, in on-chip parameter memory ( 44 ) and off-chip parameter memory ( 18 ). The entries include, for each ATM channel, a channel identifier and a timestamp that indicates the time at which the next cell for the channel will be due for transmission. A due comparator ( 40 ) compares the timestamp of the root value in the heap (i.e., the channel with the next due cell) to a global time generated by a reference timer ( 38 ), and indicates to a source behavior processor ( 24 ) in the scheduler ( 14 ) that a cell is due for transmission. The scheduler than issues a transmit credit for the cell, and communicates this event with the SAR device ( 12 ) to effect the transmission as appropriate.