TL;DR: This work introduces a new approach to decimal floating point which not only provides the strict results which are necessary for commercial applications but also meets the constraints and requirements of the IEEE 854 standard.
Abstract: Decimal arithmetic is the norm in human calculations, and human centric applications must use a decimal floating point arithmetic to achieve the same results. Initial benchmarks indicate that some applications spend 50% to 90% of their time in decimal processing, because software decimal arithmetic suffers a 100/spl times/ to 1000/spl times/ performance penalty over hardware. The need for decimal floating point in hardware is urgent. Existing designs, however, either fail to conform to modern standards or are incompatible with the established rules of decimal arithmetic. We introduce a new approach to decimal floating point which not only provides the strict results which are necessary for commercial applications but also meets the constraints and requirements of the IEEE 854 standard. A hardware implementation of this arithmetic is in development, and it is expected that this will significantly accelerate a wide variety of applications.
TL;DR: Two novel designs for fixed-point decimal multiplication are presented that utilize decimal carry-save addition to reduce the critical path delay and can be extended to support decimal floating-point multiplication.
Abstract: Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. We present two novel designs for fixed-point decimal multiplication that utilize decimal carry-save addition to reduce the critical path delay. First, a multiplier that stores a reduced number of multiplicand multiples and uses decimal carry-save addition in the iterative portion of the design is presented. Then, a second multiplier design is proposed with several notable improvements including fast generation of multiplicand multiples that do not need to be stored, the use of decimal (4:2) compressors, and a simplified decimal carry-propagate addition to produce the final product. When multiplying two n-digit operands to produce a 2n-digit product, the improved multiplier design has a worst-case latency of n+4 cycles and an initiation interval of n+1 cycles. Three data-dependent optimizations, which help reduce the multipliers' average latency, are also described. The multipliers presented can be extended to support decimal floating-point multiplication.
TL;DR: Two novel architectures for parallel decimal multipliers are introduced based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits and three schemes for fast and efficient generation of partial products in parallel are presented.
Abstract: This paper introduces two novel architectures for parallel decimal multipliers. Our multipliers are based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits. It significantly improves the area and latency of the partial product reduction tree with respect to previous proposals. We also present three schemes for fast and efficient generation of partial products in parallel. The recoding of the BCD-8421 multiplier operand into minimally redundant signed-digit radix-10, radix-4 and radix-5 representations using new recoders reduces the complexity of partial product generation. In addition, SD radix-4 and radix-5 recodings allow the reuse of a conventional parallel binary radix-4 multiplier to perform combined binary/decimal multiplications. Evaluation results show that the proposed architectures have interesting area-delay figures compared to conventional Booth radix-4 and radix-8 parallel binary multipliers and other representative alternatives for decimal multiplication.
TL;DR: The IBM POWER6 processor--on which a substantial amount of area has been devoted to increasing performance of both scientific and commercial workloads--is the first commercial hardware implementation of the IEEE 754R Binary Floating-point Arithmetic Standard.
Abstract: The IBM POWER6™ microprocessor core includes two accelerators for increasing performance of specific workloads. The vector multimedia extension (VMX) provides a vector acceleration of graphic and scientific workloads. It provides single instructions that work on multiple data elements. The instructions separate a 128-bit vector into different components that are operated on concurrently. The decimal floating-point unit (DFU) provides acceleration of commercial workloads, more specifically, financial transactions. It provides a new number system that performs implicit rounding to decimal radix points, a feature essential to monetary transactions. The IBM POWER™ processor instruction set is substantially expanded with the addition of these two accelerators. The VMX architecture contains 176 instructions, while the DFU architecture adds 54 instructions to the base architecture. The IEEE 754R Binary Floating-Point Arithmetic Standard defines decimal floating-point formats, and the POWER6 processor--on which a substantial amount of area has been devoted to increasing performance of both scientific and commercial workloads--is the first commercial hardware implementation of this format.
TL;DR: The direct production of decimal sums offers a significant improvement in addition over methods requiring decimal correction, and these techniques are illustrated in the eight-digit adder which appears in the System/360 Model 195.
Abstract: Parallel decimal arithmetic capability is becoming increasingly attractive with new applications of computers in a multi-programming environment. The direct production of decimal sums offers a significant improvement in addition over methods requiring decimal correction. These techniques are illustrated in the eight-digit adder which appears in the System/360 Model 195.