English and Chinese are available.

### 1. Defination of Constants

`b`

，base(radix)，2 or 10`p`

，precision`emax`

，max value of exponent`emin`

，min value of exponent

For all floating-point format，`emin = 1- emax`

### 2. Representation of floating-point format

Floating-point format is consisted of radix, encoding bits. (i.e. binary64)

Floating-point number should be represent to this:

**(−1) ^{s} × b^{e}×m**

- S: is 0 or 1
- e: emin <= e <= emax
- m is a number represented by a digit string of the form d
_{0}・d_{1}d_{2}...d_{p-1}where d_{i}is an integer digit 0<= d_{i}<=b(therefor 0 <= m <=b)- Two infinities, +∞ and −∞.
- Two NaNs(Not a numbers), qNaN (quiet) and sNaN (signaling)

### 3. Binary Floating Number

The binary floating-point format is the familiar representation, which is also the standard of IEEE754-1985.

A binary floating-point number will be represented in this form:

- S:
**sign**1-bit - E:
**exponent**w-bit E = e + bias - T:
**tail**t-bit t = p - 1, T = d1d2⋯dp−1 [d_{i}: binary number]

Param | binary16 | binary32 | binary64 | binary128 | binary{k} |
---|---|---|---|---|---|

k | 16 | 32 | 64 | 128 | k, 32\ |

p | 11 | 24 | 53 | 113 | k-round(4log2 k)+13 |

emax | 15 | 127 | 1023 | 16383 | 2^{k-p-1}-1 |

bias | 15 | 127 | 1023 | 16383 | emax |

sign bit | 1 | 1 | 1 | 1 | 1 |

w | 5 | 8 | 11 | 15 | round(4log2 k)−13 |

t | 10 | 23 | 52 | 112 | k-w-1 |

###
**Example of binary32**

#### 1). Normalized number

E is an 8-bit unsigned integer, representing a range of 0 to 255, and E-bias is a range of -127 to 128

E_{0} ~ E_{ W-1 } **is not all 0** (E = 0, E-bias is -127) or **is not all 1** (E = 255, E-bias is 128).

Now, the formula of floating-point number is **(-1) ^{S} × 2^{E-bias} × m.**

**m = 1 + 2 ^{-t} T = 1 + ∑(i: 1 ~ t) 2^{-i}d_{i}**

E-bias: (-127, 128)

T is represented as a floating-point number that is greater than or equal to 1 and less than or equal to 2 in scientific notation.

S, E, T should be representing and storing in binary.

Example:

float(binary32): 9.0

-> Trans to binary：1001.0

-> Representation：-1

^{0}* 2^{(3+127)}* 1.001 -> -1^{0}* 2^{(130)}* 1.001 (bias: 127)-> Layout:

S(+) E(130) T(1)

0 10000010 00000000000000000000001

**Tip.**

Why does **bisa** exist?

For E (exponent), E is an unsigned integer, so the value of E is in the range (0~ 255). However, the exponent can be negative in counting, and for the sake of range symmetry, it is required to add the middle number (127) to the original value of E when storing it, and subtract the middle number (127) when using it. So the real range of E is (-127 to 128).

#### 2). Subnormalized Number

When E_{0} ~ E_{w-1} are 0 (E is 0, the exponent(E-bias) is -126 [NOT -127, due to an unrepresented exception]), leading bit from 1 to 0。

Due to 0<= T< 2^{t}, 1<= m < 2，normalized number can't represent 0. Therefore, when abs(a number) < b^{emin}, the number will transform to subnormalized number format to represent.

Now, the formula of floating-point number is **(−1) ^{S}×2^{emin−t}T = (-1)^{S}2^{emin}∑(i=1 ~ t) 2^{-i}d_{i}**

(0 is represented to: `S 000...000`

)

#### 3.）Special Values

`+∞`

：S -- 0 E -- E_{i} = 1 T = 0

`-∞`

: S -- 1 E -- E_{i} = 1 T = 0

`NaN`

：Each bit of E is 1

The difference of quiet NaN and signaling NaN is flag bit of significand segment.

Quiet NaN does not raise any additional exceptions (FPUs do not raise hardware exceptions) and they are used in most operations. The exception is that you cannot simply pass NaN to the output intact, for example during format conversion or some comparison operations.

The opposite is Signaling NaN

### 4. Decimal Floating-Point Format Number

#### Take a brief look

Decimal Floating-Point format number has two encoding method, one called DPD(Densely Packed Decimal), and the other called BIS(a.k.a BID Binary Integer Decimal).

#### 4.0 Value of Decimal Floating-Point Number

(-1)^{S} * T * 10^{E - bias}

The values of T and E are calculated from the combined part and the respective continuation part (mantissa).

#### 4.1 Organization

- S: Sign 1-bit
- Comb: Combination
- E: Exponent w-bit E = e + bias
- T: Tail t-bit t = p - 1, T = d1d2⋯dp−1

Param | decimal32 | decimal64 | decimal128 | decimal{k} |
---|---|---|---|---|

k | 32 | 64 | 128 | k, 32\ |

p | 7 | 16 | 34 | 9k/32-2 |

emax | 96 | 384 | 6144 | 3*2^(k/16+3) |

bias | 101 | 398 | 6167 | emax+p-2 |

sign bit | 1 | 1 | 1 | 1 |

w | 6 | 8 | 12 | k/16+4 |

t | 20 | 50 | 110 | 15*k/16-10 |

#### 4.2 Significand -- Difference between BID & DPD

BID and DPD coding are used to code each part into binary for storage. The difference is in the significand section, BID is to directly take the significant digital part of scientific enumbering method and convert it into binary for storage. DPD coding uses a mapping table, and every 3 decimal digits correspond to 10 binary digits. To store.

As for decimal encoding, some hardware directly supports decimal processing, such as IBM POWER. At this time, this standard is directly used to store and calculate numbers, or DPD encoding is needed to convert binary for storage.

#### 4.3 Comb

In both DPD and BID cases, the most significant 4 bits of the significand (which actually only have 10 possible values [0~9]) are combined with the most significant 2 bits of the exponent (3 possible values) to use 30 of the 32 possible values of a 5-bit field. The remaining combinations encode infinities and NaNs.

Combination field | Exponent Msbits | Significand Msbits | Other |
---|---|---|---|

00mmm | 00 | 0mmm | — |

01mmm | 01 | 0mmm | — |

10mmm | 10 | 0mmm | — |

1100m | 00 | 100m | — |

1101m | 01 | 100m | — |

1110m | 10 | 100m | — |

11110 | — | — | ±Infinity |

11111 | — | — | NaN. Sign bit ignored. First bit of exponent continuation field determines if NaN is signaling. |

0 and 100 on Significand Msbits is NOT represent in significand

This part can be calculate from Comb part.

##### 4.3.1 DPD

The Comb part occupied 5 bis for Decimal64 and **the 5 bits come from exponent(E) and significand(T).**

- G
_{0}G_{1}G_{2}G_{3}G_{4}

G_{0}G_{1} : is the most two bits of exponent.

G_{2}G_{3}G_{4} : is the most three bits of significand.

- 11 G
_{2}G_{3}G_{4}

G_{2}G_{3 }: is the most two bits of exponent.

G_{4}: 8_{(10)} + G_{4 (2)} is the most bits of significand.

- 1111 G
_{4}

Special value, Infinity or NaN

##### 4.3.2 BIS (a.k.a. BID: Binary Integer Decimal)

This means encodes the significand part to binary directly.

The start of exponent bit and significand are uncertain.

The start of exponent is up to the most two bits of the Comb.

Therefore, we index each bit of whole floating-number model from b_{0} to b_{k-1}.

The rules:

- When b
_{1}b_{2}_{(2)}!= 11_{(2)}, the exponent part is consist of b_{1}b_{2}and last w bits, the rest of bits are the significand. - When b
_{1}b_{2}_{(2)}== 11_{(2)}and b_{3}b_{4}_{(2)}!= 11_{(2)}, the exponent part is consist of b_{3}b_{4}and last w bits, the rest of bits are the significand. - ∞ and NaN are following DPD rules.

#### 4.4 Exponent

As for non-special value, the exponent consist of two Comb(00/01/10) bit and last w bits for BID. Total kinds: 3 * 2^{w}

As for DPD, the Comb has 5 bits and the Exponent has 8 (w)bits, which consist of two Comb(00/01/10) bit(hiden) and last w bits.

#### 4.5 Significand

In both cases(BID DPD), the most significand which is hiden comes from the Comb part and the rest bits are rest of significand.

##### 4.5.1 BID

T will be encoded in binary directly.

##### 4.5.2 DPD

For BCD encoding, uses four bits to encode each digit, resulting in significant wastage of binary data bandwidth(10 used / 16 total states).

In DPD encoding, an encoding that maps from decimal to binary is used in order to use the decimal in the mantissa, which is not code like 8421BCD -- it is a waste of space. For the new code, we want to find a positive power of two, such that the ratio of the smaller and closest positive power of ten to it is as close as possible to 1, in order to save space. On the other hand, the positive integer power of the two should be as small as possible, so that the granularity can be small, and it is easier to allocate space to floating-point format with small space.

We use 10bits(0 ~ 2^{10}-1=1023) to represent 3 decimal digits.(0~999)

(If we use BCD encoding method to represent decimal digits, 10bits can represent 2~3digits.)

How to represent digits by DPD table?

To the left are the DPD encoded values, and to the right are the original three-digit decimal digits (a.k.a: a declet).

i.e. Let's describe line 3.

The letters `a`

,`b`

,`c`

which have a green background, `g`

,`h`

,`i`

which have a green background, `f`

which has a purple background, are same digits on both side. The only difference between both sides the relative positions.

The left binary sequence encoded by DPD `abcghf101i`

(10bits) is corresponding to three origin digits separated encoded in binary(BCD code), `0abc`

(d2) `100f`

(d1) `0ghi`

(d0).

How to represent in fomula?

`0abc`

_{(2)} * **100**_{(10)} + `100f`

_{(2)} * **10**_{(10)} + `0ghi`

_{(2)} * **1**_{(10)}

a.k.a.

[0b_{9}b_{8}b_{7}]_{(2)} * **100**_{(10)} + [100b_{4}]_{(2)} * **10**_{(10)} + [0b_{6}b_{5}b_{0}]_{(2)} * **1**_{(10)}

#### 4.6 Model

##### BID

```
s 00eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 01eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 10eeeeeeee (0)ttt tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
```

```
s 1100eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1101eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
s 1110eeeeeeee (100)t tttttttttt tttttttttt tttttttttt tttttttttt tttttttttt
```

**Special Value.**

```
s 11110 xx...x ±infinity
s 11111 0x...x qNaN
s 11111 1x...x sNaN
```

##### DPD

```
s 00 TTT (00)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 01 TTT (01)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 10 TTT (10)eeeeeeee (0TTT)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
```

```
s 1100 T (00)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1101 T (01)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
s 1110 T (10)eeeeeeee (100T)[tttttttttt][tttttttttt][tttttttttt][tttttttttt][tttttttttt]
```

## Discussion (0)