TL;DR: It is shown that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies.
Abstract: For maximum performance, an out-of-order processor must issue load instructions as early as possible, while avoiding memory-order violations with prior store instructions that write to the same memory location. One approach is to use memory dependence prediction to identify the stores upon which a load depends, and communicate that information to the instruction scheduler. We designate the set of stores upon which each load has depended as the load's "store set". The processor can discover and use a load's store set to accurately predict the earliest time the load can safely execute. We show that store sets accurately predict memory dependencies in the context of large instruction window, superscalar machines, and allow for near-optimal performance compared to an instruction scheduler with perfect knowledge of memory dependencies. In addition, we explore the implementation aspects of store sets, and describe a low cost implementation that achieves nearly optimal performance.
TL;DR: In this article, an out-of-order execution of load and store instructions is dynamically controlled by detecting a store violation condition and avoiding the penalty of a pipeline recovery process by using a unique store barrier cache which is used to dynamically predict whether or not a violation condition is likely to occur and, if so, to restrict the issue of instructions to the load/store unit until the store instruction has been executed.
Abstract: An apparatus to dynamically controls the out-of-order execution of load/store instructions by detecting a store violation condition and avoiding the penalty of a pipeline recovery process. The apparatus permits a load and store instruction to issue and execute out of order and incorporates a unique store barrier cache which is used to dynamically predict whether or not a store violation condition is likely to occur and, if so, to restrict the issue of instructions to the load/store unit until the store instruction has been executed and it is once again safe to proceed with out-of-order execution. The method implemented by the apparatus delivers performance within one percent of theoretically possible with apriori knowledge of load and store addresses.
TL;DR: In this article, the cache store is operated in parallel to the request for data information from the main memory store and a successful retrieval from the cache cache store aborts the retrieval from a main memory.
Abstract: A cache store located in the processor provides a fast access look-aside store to blocks of data information previously fetched from the main memory store. The request to the cache store is operated in parallel to the request for data information from the main memory store. A successful retrieval from the cache store aborts the retrieval from a main memory. Block loading of the cache store is performed autonomously from the processor operations. The cache store is cleared on cycles such as interrupts which require the processor to shift program execution. The store-aside configuration of the processor overlooks the backing store cycle on a store operand cycle and the cache store checking operations are performed next causing the cycles to be performed simultaneously.
TL;DR: In this paper, the authors present a system that selectively monitors store instructions to support transactional execution of a process, where changes made during the transactional operation are not committed to the architectural state of a processor until the transaction successfully completes.
Abstract: One embodiment of the present invention provides a system that selectively monitors store instructions to support transactional execution of a process, wherein changes made during the transactional execution are not committed to the architectural state of a processor until the transactional execution successfully completes. Upon encountering a store instruction during transactional execution of a block of instructions, the system determines whether the store instruction is a monitored store instruction or an unmonitored store instruction. If the store instruction is a monitored store instruction, the system performs the store operation, and store-marks a cache line associated with the store instruction to facilitate subsequent detection of an interfering data access to the cache line from another process. If the store instruction is an unmonitored store instruction, the system performs the store operation without store-marking the cache line.
TL;DR: In this article, an improved method and apparatus for performing program code conversion is provided and, more particularly, for generating improved intermediate representations for use in program code conversions, where a partial dead code elimination optimization technique is implemented to identify partially dead register definitions within a block of program code being translated.
Abstract: An improved method and apparatus for performing program code conversion is provided and, more particularly, for generating improved intermediate representations for use in program code conversion. During program code conversion, a partial dead code elimination optimization technique is implemented to identify partially dead register definitions within a block of program code being translated. The partial dead code elimination is an optimization to the intermediate representation in the form of code motion for blocks of program code ending in non-computed branches or computed jumps, where target code for all dead child nodes of a partially dead register definition is prevented from being generated and target code for partially dead child nodes of a partially dead register definition is delayed from being generated until after target code is generated for all fully live child nodes for the partially dead register definition.