TL;DR: Several novel ESX Server mechanisms and policies for managing memory are introduced, including a ballooning technique that reclaims the pages considered least valuable by the operating system running in a virtual machine, and an idle memory tax that achieves efficient memory utilization.
Abstract: VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified commodity operating systems. This paper introduces several novel ESX Server mechanisms and policies for managing memory. A ballooning technique reclaims the pages considered least valuable by the operating system running in a virtual machine. An idle memory tax achieves efficient memory utilization while maintaining performance isolation guarantees. Content-based page sharing and hot I/O page remapping exploit transparent page remapping to eliminate redundancy and reduce copying overheads. These techniques are combined to efficiently support virtual machine workloads that overcommit memory.
TL;DR: It is found that the hardware support for Virtual Machine Monitors for x86 fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MM U virtualization.
Abstract: Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware ® Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, both Intel and AMD have now introduced architectural extensions to support classical virtualization.We compare an existing software VMM with a new VMM designed for the emerging hardware support. Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM. To determine why, we study architecture-level events such as page table updates, context switches and I/O, and find their costs vastly different among native, software VMM and hardware VMM execution.We find that the hardware support fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MMU virtualization. We look ahead to emerging techniques for addressing this MMU virtualization problem in the context of hardware-assisted virtualization.
TL;DR: A tiny hypervisor that ensures code integrity for commodity OS kernels, SecVisor ensures that only user-approved code can execute in kernel mode over the entire system lifetime, which protects the kernel against code injection attacks, such as kernel rootkits.
Abstract: We propose SecVisor, a tiny hypervisor that ensures code integrity for commodity OS kernels. In particular, SecVisor ensures that only user-approved code can execute in kernel mode over the entire system lifetime. This protects the kernel against code injection attacks, such as kernel rootkits. SecVisor can achieve this propertyeven against an attacker who controls everything but the CPU, the memory controller, and system memory chips. Further, SecVisor can even defend against attackers with knowledge of zero-day kernel exploits.Our goal is to make SecVisor amenable to formal verificationand manual audit, thereby making it possible to rule out known classes of vulnerabilities. To this end, SecVisor offers small code size and small external interface. We rely on memory virtualization to build SecVisor and implement two versions, one using software memory virtualization and the other using CPU-supported memory virtualization. The code sizes of the runtime portions of these versions are 1739 and 1112 lines, respectively. The size of the external interface for both versions of SecVisor is 2 hypercalls. It is easy to port OS kernels to SecVisor. We port the Linux kernel version 2.6.20 by adding 12 lines and deleting 81 lines, out of a total of approximately 4.3 million lines of code in the kernel.
TL;DR: An in-depth examination of the 2D page table walk overhead and options for decreasing it is presented, which includes using the AMD Opteron processor's page walk cache to exploit the strong reuse of page entry references.
Abstract: Nested paging is a hardware solution for alleviating the software memory management overhead imposed by system virtualization. Nested paging complements existing page walk hardware to form a two-dimensional (2D) page walk, which reduces the need for hypervisor intervention in guest page table management. However, the extra dimension also increases the maximum number of architecturally-required page table references.This paper presents an in-depth examination of the 2D page table walk overhead and options for decreasing it. These options include using the AMD Opteron processor's page walk cache to exploit the strong reuse of page entry references. For a mix of server and SPEC benchmarks, the presented results show a 15%-38% improvement in guest performance by extending the existing page walk cache to also store the nested dimension of the 2D page walk. Caching nested page table translations and skipping multiple page entry references produce an additional 3%-7% improvement.Much of the remaining 2D page walk overhead is due to low-locality nested page entry references, which result in additional memory hierarchy misses. By using large pages, the hypervisor can eliminate many of these long-latency accesses and further improve the guest performance by 3%-22%.
TL;DR: In this article, a computer system may be configured to dynamically select a memory virtualization and corresponding virtual-to-physical address translation technique during execution of an application and to dynamically employ the selected technique in place of a current technique without re-initializing the application.
Abstract: A computer system may be configured to dynamically select a memory virtualization and corresponding virtual-to-physical address translation technique during execution of an application and to dynamically employ the selected technique in place of a current technique without re-initializing the application. The computer system may be configured to determine that a current address translation technique incurs a high overhead for the application's current workload and may be configured to select a different technique dependent on various performance criteria and/or a user policy. Dynamically employing the selected technique may include reorganizing a memory, reorganizing a translation table, allocating a different block of memory to the application, changing a page or segment size, or moving to or from a page-based, segment-based, or function-based address translation technique. A selected translation technique may be dynamically employed for the application independent of a translation technique employed for a different application.