TL;DR: In this paper, a hierarchical priority system is used to determine whether a branch test condition associated with a branch instruction is true, and independently, the target address for each branch instruction and a fall-through instruction address are determined.
Abstract: In a parallel data processing system having a plurality of separately operating arithmetic processing units, a method and apparatus allows a plurality of branch instructions to be operated upon in a single machine cycle. The branch instructions have associated therewith a hierarchical priority system and the method and apparatus determine which branch, if any, should be taken. In particular, the method and apparatus simultaneously determine, during the parallel execution of the branch instructions, whether any branch test condition associated with a branch instruction is true, and independently, the target address for each branch instruction and a fall-through instruction address if a branch instruction is not taken.
TL;DR: In this article, the results of an instruction trace mechanism are used to globally restructure the instructions in an executable program, using an actual execution profile (or instruction address trace) for a selected workload, to improve utilization of the existing hardware architecture.
Abstract: A system and method are provided that allows the results of an instruction trace mechanism to globally restructure the instructions The process reorders the instructions in an executable program, using an actual execution profile (or instruction address trace) for a selected workload, to improve utilization of the existing hardware architecture The reordering of instructions is implemented at a global level (ie, independent of procedure or other structural boundaries which maximizes speedup) running on various hardware platforms and adds the ability to preserve correctness and debuggability for reordered executables An unconditional branch instruction is added at the memory locations where reordered instructions previously were stored When a dynamic branch occurs, the program will attempt to access the instruction at the original address and the unconditional branch directs the program to the reordered location of the instruction and program integrity is maintained
TL;DR: In this paper, a run-time optimization strategy uses a trace picker to identify traces of program code in a native code pool, and a translator to translate the traces into a code cache where the traces are executed natively.
Abstract: A run-time optimization strategy uses a trace picker to identify traces of program code in a native code pool, and a translator to translate the traces into a code cache where the traces are executed natively. Static branch prediction hints are encoded in branch instruction in the translated traces. A program module implementing the present invention is initialized with an empty code cache and a pool of instruction in a native code pool. The trace picker analyzes the instructions in the native code pool and identifies traces of instructions that tend to be executed as a group. When a trace is identified, basic blocks lying along the trace path are translated into a code cache, with static branch predictions encoded into the branch instructions of the basic blocks based on branching behavior observed when the trace is identified. Control then passes to the basic blocks in the code cache, and the basic blocks in the code cache are executed natively using the static branch prediction hints encoded into the branch instructions.
TL;DR: In this paper, a branch prediction and verification method for superscalar pipelined computers is presented, and two sets of prefetch buffers are provided to allow branch prediction when multiple clock decoding is required by a multi-clock instruction.
Abstract: An apparatus and method for improving the performance of superscalar pipelined computers using branch prediction and verification that the predicted branch is correct. A predicted branch may be resolved in one of two distinct pipeline stages, and a method is provided for handling branches that are resolved in either of the pipeline stages. A branch verification method is provided that verifies that the architecturally correct instructions are in the decode and execution stages. Furthermore, two sets of prefetch buffers are provided to allow branch prediction when multiple clock decoding is required by a multi-clock instruction.
TL;DR: In this article, a trace recording hardware is used to reconstruct a program trace from an external debug host computer using an abbreviated real-time program trace with reference to a program listing.
Abstract: A digital microprocessor having a processor core is provided with trace recording hardware capable of receiving, analyzing and temporarily storing data indicative of program instructions (i.e., instruction types) executed by the processor core and of their respective addresses. The trace recording hardware outputs an abbreviated real-time program trace, containing minimum data necessary to reconstruct a full program trace, via a JTAG port to an external debug host computer where a user may reconstruct the full program trace with reference to a program listing. The abbreviation scheme used by the trace recording hardware is preferably achieved by comparing instruction types received from the processor core to at least one pre-defined instruction type, and abbreviating or discarding the corresponding address information as a function of the particular instruction type. The trace recording hardware may be set into one of two modes by the user. In the first mode, the trace recording hardware stalls the processor core when it reaches its maximum storage capacity for instruction type and/or address data until storage becomes available. In the second mode, when the trace recording hardware becomes full it discards data received from the processor core and stores an overflow indicator. The program trace may be initiated and stopped by the user or by signals internal to the digital microprocessor.