About: Half-precision floating-point format is a research topic. Over the lifetime, 106 publications have been published within this topic receiving 2885 citations. The topic is also known as: binary16 & float.
TL;DR: In this paper, the IEEE 32-bit floating-point format was replaced with the IEEE 8-bit format, which allows floating point commands to be executed in a fixed small number of cycles, thus advancing the capabilities of doing floating point arithmetic on a SIMD machine.
Abstract: A floating point system and method according to a format that includes a sign bit, an exponent part having a plurality of bits, and a fraction part having a plurality of multi-bit blocks, wherein floating point operation is based on block shifts of the fraction part, with each shift of one block associated with an increment or decrement of the exponent part by one count. This format illustrated is implemented as a format suitable for the accuracy greater than the IEEE 32-bit floating-point format, and is intended to be implemented in machines having byte-wide (8 bit) data streams. The preferred format consists of a sign bit, 7 exponent bits and 4 fraction bytes of eight bits for a total of 40 bits. This format and implementation allows floating-point commands to be executed in a fixed small number of cycles, thus advancing the capabilities of doing floating-point arithmetic on a SIMD machine. The floating-point implementation is adaptable to multiprocessor parallel array processor computing systems and for parallel array processing with a simplified architecture adaptable to chip implementation. The array provided is an N dimensional array of byte-wide processing units each coupled with an adequate segment of byte-wide memory and control logic. A partitionable section of the array containing several processing units is contained on a silicon chip arranged with eight elements of the processing array each preferably consisting of combined processing element with a local memory for processing bit parallel bytes of information in a clock cycle. A processor system (or subsystem) comprises an array of pickets, a communication network, an 1/0 system, and a SIMD controller consisting of a microprocessor, a canned-routine processor, and a microcontroller that runs the array.
TL;DR: This work has explored FPGA implementations of addition and multiplication for IEEE single precision floating-point numbers, and prototypes have been implemented on Altera FLEX8000s, and peak rates of 7 MFlops for 32-bit addition and 2.3 M flop multiplication have been obtained.
Abstract: Floating point operations are hard to implement on FPGAs because of the complexity of their algorithms. On the other hand, many scientific problems require floating point arithmetic with high levels of accuracy in their calculations. Therefore, we have explored FPGA implementations of addition and multiplication for IEEE single precision floating-point numbers. Customizations were performed where this was possible in order to save chip area, or get the most out of our prototype board. The implementations tradeoff area and speed for accuracy. The adder is a bit-parallel adder, and the multiplier is a digit-serial multiplier. Prototypes have been implemented on Altera FLEX8000s, and peak rates of 7 MFlops for 32-bit addition and 2.3 MFlops for 32-bit multiplication have been obtained.
TL;DR: A library of fully parameterized hardware modules for format control, arithmetic operations and conversion to and from any fixed-point format, and for hybrid implementations that combine both fixed and floating-point calculations.
Abstract: We present a parameterized floating-point library for use with reconfigurable hardware. Our format is both general and flexible. All IEEE formats are a subset of our format, as are all previously published floating-point formats for reconfigurable hardware. We have developed a library of fully parameterized hardware modules for format control, arithmetic operations and conversion to and from any fixed-point format. The format converters allow for hybrid implementations that combine both fixed and floating-point calculations. This permits the designer to choose between the increased range of floating-point and the increased precision of fixed-point within the same application. We illustrate the use of this library with a hybrid implementation of the K-means clustering algorithm applied to multispectral satellite images.
TL;DR: In this article, a technique for encoding multiple floating point formats into a double precision floating point number by padding single word floating point numbers with zeros to form a 64-bit double word was proposed.
Abstract: A technique for encoding multiple floating point formats into a double precision floating point number by padding single word floating point numbers with zeros to form a 64-bit double word in a way that allows a single precision arithmetic logic unit to be built on top of a double precision arithmetic logic unit. The formatting circuitry of the invention requires only small differences in the hardware for single and double precision operations so as to simplify the arithmetic logic unit and the multiplier of the floating point processing units. The encoding technique of the invention includes right justifying the exponent and mantissa of the floating point number in a "common format" such that rounding of the mantissa need only occur in one place, thereby greatly simplifying the rounding procedure. The technique of the invention also removes multiplexers from critical speed paths in the floating point processing units when it is desired to accommodate multiple data formats.