TL;DR: This work proposes PipeDream-2BW, a system that performs memory-efficient pipeline parallelism, a hybrid form of parallelism that combines data and model parallelism with input pipelining, able to accelerate the training of large language models with up to 2.5 billion parameters by up to 6.9x compared to optimized baselines.
Abstract: Many state-of-the-art ML results have been obtained by scaling up the number of parameters in existing models However, parameters and activations for such large models often do not fit in the memory of a single accelerator device; this means that it is necessary to distribute training of large models over multiple accelerators In this work, we propose PipeDream-2BW, a system that supports memory-efficient pipeline parallelism PipeDream-2BW uses a novel pipelining and weight gradient coalescing strategy, combined with the double buffering of weights, to ensure high throughput, low memory footprint, and weight update semantics similar to data parallelism In addition, PipeDream-2BW automatically partitions the model over the available hardware resources, while respecting hardware constraints such as memory capacities of accelerators and interconnect topologies PipeDream-2BW can accelerate the training of large GPT and BERT language models by up to 20$\times$ with similar final model accuracy
TL;DR: In this paper, the authors propose a double buffering subsystem, where a dual port memory is partitioned in software so that the top half of the memory is allocated to one processor, and the bottom half to the other.
Abstract: A novel double buffering subsystem, wherein a dual port memory is partitioned in software so that the top half of the memory is allocated to one processor, and the bottom half to the other. (This allocation is switched when both processors set respective flag bits indicating that they are ready to switch.) On accesses to this memory, additional bits tag the access as "physical," "logical," or "preview." A physical access is interpreted as a literal address within the full memory, and the double buffering is ignored. A logical access is supplemented by an additional address bit, determined by the double buffering switch state. A preview access is used for read access only, and goes to the opposite bank of memory from that which would be accessed in a logical access. This double-buffer architecture is advantageously used, in a multiprocessor system, at the interface between a numeric processor and a cache bus. The preview access can help to avoid data flow inefficiencies at synchronization points in pipelined algorithms.
TL;DR: This paper proposes a new aging scheme for Bloom filters and proves theoretically that the proposed scheme outperforms double buffering, the current state of the art, and performs experiments on real Internet traces to verify the effectiveness.
Abstract: A Bloom filter is a simple but powerful data structure that can check membership to a static set. As Bloom filters become more popular for network applications, a membership query for a dynamic set is also required. Some network applications require high-speed processing of packets. For this purpose, Bloom filters should reside in a fast and small memory, SRAM. In this case, due to the limited memory size, stale data in the Bloom filter should be deleted to make space for new data. Namely the Bloom filter needs aging like LRU caching. In this paper, we propose a new aging scheme for Bloom filters. The proposed scheme utilizes the memory space more efficiently than double buffering, the current state of the art. We prove theoretically that the proposed scheme outperforms double buffering. We also perform experiments on real Internet traces to verify the effectiveness of the proposed scheme.
TL;DR: In this article, a method and apparatus for improved double buffering within a computing system is presented, where a series of data blocks are received from a central processing unit at a rate independent of a processing rate of a recipient engine.
Abstract: A method and apparatus for improved double buffering within a computing system begins when a series of data blocks are received from a central processing unit at a rate independent of a processing rate of a recipient engine. For example, a video graphics circuit receives a series of data blocks representing video frames from the central processing unit at a rate independent of the refresh rate of the display. As the data blocks are received, the video graphics circuit queues commands of the data blocks. Typically, the commands include processing commands and a processing rate synchronize command. To process the data blocks, the co-processor pulls commands from the queued list and processes them to produce recipient data. As the co-processor is producing the recipient data, it is utilizing a first buffer. The co-processor continues to process the commands and storing the results into the first buffer until the processing rate synchronize command is detected. At this point, the co-processor pauses processing of the commands. At the beginning of the next cycle of the processing rate, the recipient data is provided from the first buffer to the recipient engine and the co-processor resumes processing of commands, which relate to another data block. As the co-processor is processing the commands of the second data block, it is utilizing a second buffer to store the processed data, i.e., the second recipient data.
TL;DR: It is found that, in most cases, interleaved layout does not improve performance, but that the new reading strategy consistently performs better than double buffering and forecasting.
Abstract: External mergesort is normally implemented so that each run is stored continuously on disk and blocks of data are read exactly in the order they are needed during merging. We investigate two ideas for improving the performance of external mergesort: interleaved layout and a new reading strategy. Interleaved layout places blocks from different runs in consecutive disk addresses. This is done in the hope that interleaving will reduce seek overhead during merging. The new reading strategy precomputes the order in which data blocks are to be read according to where they are located on disk and when they are needed for merging. Extra buffer space makes it possible to read blocks in an order that reduces seek overhead, instead of reading them exactly in the order they are needed for merging. A detailed simulation model was used to compare the two layout strategies and three reading strategies. The effects of using multiple work disks were also investigated. We found that, in most cases, interleaved layout does not improve performance, but that the new reading strategy consistently performs better than double buffering and forecasting.