TL;DR: In this article, the authors propose a zero-copy I/O sending and receiving optimization for virtualized computer system environments running one or more virtual machines that obviate the extra host operating system (0/S) copying steps required for sending and receive packets of data over a network connection.
Abstract: Techniques for virtualized computer system environments running one or more virtual machines that obviate the extra host operating system (0/S) copying steps required for sending and receiving packets of data over a network connection, thus eliminating major performance problems in virtualized environment. Such techniques include methods for emulating network I/O hardware device acceleration-assist technology providing zero-copy I/O sending and receiving optimizations. Implementation of these techniques require a host 0/S to perform actions including, but not limited to: checking of the address translations (ensuring availability and data residency in physical memory), checking whether the destination of a network packet is local (to another virtual machine within the computing system), or across an external network; and, if local, checking whether either the sending destination VM, receiving VM process, or both, supports emulated hardware accelerated-assist on the same physical system. This optimization, in particular, provides a further optimization in that the packet data checksumming operations may be omitted when sending packets between virtual machines in the same physical system.
TL;DR: This paper designs an implementation of the MPI message passing interface using a zero copy message transfer primitive supported by a lower communication layer to realize a high performance communication library.
Abstract: This paper designs an implementation of the MPI message passing interface using a zero copy message transfer primitive supported by a lower communication layer to realize a high performance communication library. The zero copy message transfer primitive requires a memory area pinned down to physical memory, which is a restricted quantity resource under a paging memory system. Allocation of pinned down memory by multiple simultaneous requests for sending and receiving without any control can cause deadlock. To avoid this deadlock, we have introduced: i) separate of control of send/receive pin-down memory areas to ensure that at least one send and receive may be processed concurrently, and ii) delayed queues to handle the postponed message passing operations which could not be pinned-down.
TL;DR: The implementation demonstrates the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in the approach as compared to the traditional host-based approach and supports a very high degree of overlap of computation and communication.
Abstract: Summary form only given. The remote memory access (RMA) is an increasingly important communication model due to its excellent potential for overlapping communication and computations and achieving high performance on modern networks with RDMA hardware such as Infiniband. RMA plays a vital role in supporting the emerging global address space programming models. We describe how RMA can be implemented efficiently over InfiniBand. The capabilities not offered directly by the Infiniband verb layer can be implemented efficiently using the novel host-assisted approach while achieving zero-copy communication and supporting an excellent overlap of computation with communication. For contiguous data we are able to achieve a small message latency of 6/spl mu/s and a peak bandwidth of 830 MB/s for 'put' and a small message latency of 12/spl mu/s and a peak bandwidth of 765 Megabytes for 'get'. These numbers are almost as good as the performance of the native VAPI layer. For the noncontiguous data, the host assisted approach can deliver bandwidth close to that for the contiguous data. We also demonstrate the superior tolerance of host-assisted data-transfer operations to CPU intensive tasks due to minimum host involvement in our approach as compared to the traditional host-based approach. Our implementation also supports a very high degree of overlap of computation and communication. 99% overlap for contiguous and up to 95% for noncontiguous in case of large message sizes were achieved. The NAS MG and matrix multiplication benchmarks were used to validate effectiveness of our approach, and demonstrated excellent overall performance.
TL;DR: In this paper, a system and method for direct transfer of information in message packet format directly from memory locations within one node of a data processing system directly to memory locations in one or more receiving nodes is presented.
Abstract: A system and method are provided which permit the direct transfer of information in message packet format directly from memory locations within one node of a data processing system directly to memory locations in one or more receiving nodes. This function is provided via communications adapters connected to each node and to a switched network to which are also attached nodes possessing likewise configured communications adapters which are capable of responding to a wide variety of message packet transfer protocol and modalities, all of which effect a direct memory to memory transfer without the need to copy data into intermediate buffers.
TL;DR: The major architectural aspects of the SDP protocol, the ZCopy implementation, and a preliminary performance evaluation are presented, showing substantial benefits of ZCopy when multiple connections are running in parallel on the same host.
Abstract: Sockets direct protocol (SDP) is a byte-stream transport protocol implementing the TCP SOCK/spl I.bar/STREAM semantics utilizing transport offloading capabilities of the infiniband fabric: Under the hood, SDP supports zero-copy (ZCopy) operation mode, using the infiniband RDMA capability to transfer data directly between application buffers. Alternatively, in buffer copy (BCopy) mode, data is copied to and from transport buffers. In the initial open-source SDP implementation, ZCopy mode was restricted to asynchronous I/O operations. We added a prototype ZCopy support for send()/recv() synchronous socket calls. This paper presents the major architectural aspects of the SDP protocol, the ZCopy implementation, and a preliminary performance evaluation. We show substantial benefits of ZCopy when multiple connections are running in parallel on the same host. For example, when 8 connections are simultaneously active, enabling ZCopy yields a bandwidth growth from 500 MB/s to 700 MB/s, while CPU utilization decreases 8 times.