## TL;DR

For precise calculations, favor arbitrary-precision decimals or equivalents like BigDecimal over floating-point numbers.

Additionally, avoid unnecessary rounding. When required, limit rounding to the final step to maintain as much accuracy as possible.

## Table of contents

- Prologue
- First things first
- Bits are not enough
- Bits and integers
- Bits and other real numbers
- Fixed-point representation
- Floating-point representation
- Issues and standards
- Floating-point data types
- Floating-point issues
- Decimals to the rescue
- Beware of rounding
- Decimals in other technologies
- Wrapping Up
- References

## 📜 Prologue

Oh yes, **floating-point** numbers.

They frequently appear in technical content, full of scientific notations and complex explanations.

It's almost certain that every programmer has already faced the notion that working with floating-point numbers can be **perilous**, resulting in **imprecise** arithmetic outcomes, among other issues.

However, comprehending all the underlying reasons behind this *crucial topic in computer science* can be challenging for many.

In today's post, we will delve into the **problems that floating-point numbers address** and explore the **involved caveats**.

So, grab a refreshing bottle of water, and let's embark on yet another journey into the realm of *floating-point* numbers.

## 👍🏼 First things first

Computers can only understand **machine language**.

*Machine language* is a collection of "bits" that contain data and instructions for the CPU. We represent those bits as **binary bits** and as such, it's called the **base-2 numeral system** (0 and 1).

```
01001001 01001000 11001011 01000001 01001000 10001000
01011001 01001000 01000001 01101001 01001000 01001001
11000001 10001000 01001001 11001010 10001000 01001000
11001001 01001000 11001001 01001000 01001000 01001001
```

Programming directly in machine language is highly *error-prone* and often inefficient in many scenarios. To address this, **assembly languages** were introduced over the years, serving as a bridge between CPU architecture specifics and a higher-level set of instructions.

**Assembly languages** are translated into machine code through a dedicated program called "assembler." Each CPU architecture typically has its own assembler associated with it.

This allows programmers to work with a more manageable and human-readable instruction set that is then translated into machine code specific to the target architecture.

```
section .data
number1 dd 10 ; Define the first number as a 32-bit float
number2 dd 20 ; Define the second number as a 32-bit float
section .text
global _start
_start:
; Load the first number into xmm0 register
movss xmm0, dword [number1]
; Load the second number into xmm1 register
movss xmm1, dword [number2]
.....
.....
```

Advancements in the field of computer engineering have paved the way for the development of increasingly high-level programming languages that can directly translate into machine code instructions.

Over the course of the following decades, languages like **C, Java, and Python**, among others, emerged, enabling individuals with limited knowledge of computer internals to write programs for computers.

This significant accomplishment has had a profound impact on the industry, as computers became more compact and faster, empowering modern software engineering practices to deliver substantial value to businesses worldwide.

## 🔵 Bits are not enough

As mentioned earlier, computers solely comprehend **binary bits**.

Nothing else in this world can be interpreted by computers.

*Only. Bits.*

💡 Actually, CPUs in electronic computers comprehend only the absence or presence of voltage, allowing us to represent information using 0 and 1 (off and on)

However, real-life scenarios present challenges where computer programs, which are created **by people for people**, need to represent a broader range of characters beyond just 0s and 1s. This includes letters, decimal numbers, hexadecimal numbers, special characters, punctuation marks, and even emojis like 😹.

Standard character sets such as **ASCII** and **Unicode** schemes solve the challenge of representing numbers, letters, special characters, emojis, and more within the binary system.

⚠️ Delving into the intricacies of character encoding is beyond the scope of this article. It will be covered in future posts

*Here, our focus will be specifically on how computers work with numbers in memory*, particularly **integers**.

## 🔵 Bits and integers

Let's take the number *65* as an example. It is represented in the **base 10** numeral system, making it a real number.

Moreover, it is classified as an **integer**.

By performing conversions based on powers of 2, we can represent the integer 65 as `01000001`

in an 8-bit binary format. This binary representation can be converted back and forth to the decimal value 65.

From a *mathematical perspective*, since **65 is an integer**, it fits within a single byte (8 bits). Moreover, performing powers of 2, we know that a single byte can accomodate 256 numbers:

```
2^8 = 256
```

Naively speaking, one might assume that a single byte can represent integers ranging from 0 to 255.

However, integers must represent both *negative and positive* numbers. How should we evenly distribute those integers in a single byte?

We should employ a technique called **two's complement**.

### 👉 Two's complement

To evenly distribute negative and positive non-fractional integers within 8 bits, we can use a technique called **two's complement**. In this technique:

- the leftmost bit serves as the
**sign bit**, indicating whether the number is positive or negative - all the bits are
*flipped or inverted* - we then
**add 1**to the resulting value

This way, a single byte represents integers ranging from -128 to 127.

```
2^8 = 256
-127, -126, -125...127, 128
```

### 👉 Using two bytes

By employing the two's complement technique, we can also represent a range of integers using two bytes (16 bits). Utilizing the concept of powers of 2, we can observe that two bytes can accommodate a total of 65536 different values:

```
2^16 = 65536
```

Considering negative numbers, the range extends from -32768 to 32767, inclusive.

Now, let's explore some examples using **PostgreSQL**. If you prefer to work with containers, setting up a quick `psql`

terminal is straightforward. You can achieve it by running the following commands:

```
$ docker run --rm -d \
--name postgres \
-e POSTGRES_HOST_AUTH_METHOD=trust \
postgres
```

Then, access the `psql`

terminal with the following command:

```
$ docker exec -it postgres psql -U postgres
```

In PostgreSQL, the data type that represents a two-byte integer is called **int2** or **smallint**:

```
SELECT 65::int2;
int2
------
65
```

To check the data type, we can use the function `pg_typeof`

:

```
SELECT pg_typeof(65::int2);
pg_typeof
-----------
smallint
```

As **smallint** uses two bytes, it can only accommodate the range we mentioned earlier in terms of bits and integers:

```
SELECT 32767::int2;
int2
-------
32767
SELECT -32767::int2;
int2
-------
-32767
```

However, if we attempt to exceed the range:

```
SELECT 32768::int2;
ERROR: smallint out of range
```

Pretty neat, *isn't it*?

In addition to **smallint**, PostgreSQL offers a variety of other integer data types:

Data Type | Description | Range of Integers |
---|---|---|

smallint | Two-byte integer | -32,768 to 32,767 |

integer | Four-byte integer | -2,147,483,648 to 2,147,483,647 |

bigint | Eight-byte integer | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |

However, we all know that the world is not only integers. Integers are a subset of a broader set of numbers, called **real numbers**.

## 🔵 Bits and other real numbers

**Real numbers** can include integers, fractions, and decimals, both rational and irrational.

For instance, *3.14159* represents the real **number π** (pi), which is an irrational number. It is a *non-repeating and non-terminating decimal*. The value of π extends infinitely without any pattern in its decimal representation.

```
3.14159265358979323846....
```

Suppose we have two bytes (16 bits), which can represent 65536 integers ranging from -32768 to 32767.

When it comes to representing other real numbers, such as decimals, we can use a technique called **fixed-point**.

## 🔵 Fixed-point representation

In fixed-point representation, we split the provided 16 bits into three sections:

### 👉 Sign bit

The first bit (leftmost) represents the sign, being 1 for negative and 0 for positive.

### 👉 Decimal part

The next 7 bits represent the decimal (fracional) part, which can have a precision of up to `0.992188`

in our simulation:

```
2^-7 + 2^-6 + ... + 2^-1 =
0.992188
```

### 👉 Integer part

The remaining 8 bits represent the integer part, which can go up to `127`

using two's complement :

```
two_complement(
2^7 + 2^6 + ... + 2^1 =
127
)
```

Considering that the integer part, using 8 bits with two's complement, ranges from -128 to 127, we can conclude that, with *fixed-point representation*, decimals can range from **-128.992188 to 128.992188**.

However, this technique may not always be the most efficient. Let's explore another technique for representing decimals.

Yes, we are talking about the widely used **floating-point** representation.

## 🔵 Floating-point representation

Taking 16 bits still as an example, in floating-point representation we also split the 16 bits into three groups:

### 👉 Sign bit

The first bit (leftmost) is used to represent whether the number is negative (1) or positive (0).

### 👉 Exponent part

This crucial component, known as the *floating-point*, is assigned the next X bits, signifying its importance.

For our simulation, let's allocate 7 bits for the exponent part, while utilizing the first exponent bit for the exponent sign.

As a result, the range for the exponent extends from -63 to 63, accommodating both negative and positive values:

```
2^5 + 2^4 + ... 2^1 =
63
```

*This part is crucial for defining arithmetic precision in floating-point representation.*

### 👉 Mantissa

The **Mantissa** part, also known as the *significand*, takes the remaining 8 bits, allowing for a range of up to 255.

*As we are not representing the integer part in this simulation, there is no need to apply two's complement to the mantissa.*

🔑 **Now the key part**

To calculate the maximum positive floating-point number, we multiply the mantissa by the exponent.

In this case, the maximum positive value would be obtained by **multiplying 255 by 2^6**, resulting in an exceedingly large number like **2351959869397967831040.0**.

Conversely, the minimum positive number can be represented as 1 multiplied by **2^-63**, or **0.00000000000000000010842021724855044340074528008699**.

Please note that this simulation is a simplified representation with limited precision and may not reflect the accuracy of ideal or standardized floating-point formats.

## 🔵 Issues and standards

Indeed, as mentioned earlier, **selecting an appropriate number of bits for the exponent part** in floating-point representation is crucial to mitigate issues with rounding and truncation when handling fractional numbers.

Standards like **IEEE 754** were established precisely to address these concerns and provide a consistent framework for floating-point representation. The IEEE 754 standard defines the number of bits allocated to the exponent, mantissa, and sign in both single precision (32 bits) and double precision (64 bits) formats.

These standards determine the precise representation of the various components of a floating-point number, the *rules for arithmetic operations*, and how to handle exceptional cases.

### 👉 Single precision (4 bytes)

**Single precision** numbers are represented using 32 bits of memory.

They include:

- 1 bit for the sign of the number
- 8 bits for the exponent
- 23 bits for the mantissa

According to the IEEE standards, single precision can typically **handle 6 to 9 decimal place precision**.

### 👉 Double precision (8 bytes)

**Double precision** numbers are represented using 64 bits of memory.

They include:

- 1 bit for the sign of the number
- 11 bits for the exponent
- 52 bits for the mantissa

According to the IEEE standards, double precision can **handle 15 to 17 decimal places of precision**.

*Usually, double-precision fits better when high precision is mandatory, but it consumes more memory.*

## 🔵 Floating-point data types

Many programming languages and database systems adhere to the IEEE 754 standards, and PostgreSQL is no exception.

Let's see how PostgreSQL implement float data types in action.

The datatype **float4** conforms to the IEEE 754 single-precision standard, which allocates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa:

```
SELECT 0.3::float4;
float4
--------
0.3
```

Conversely, the datatype **float8** conforms to the IEEE 754 double-precision standard, which allocates 1 bit for the sign, 811bits for the exponent, and 52 bits for the mantissa:

```
SELECT 0.3::float8;
float8
--------
0.3
#####################
SELECT 0.3::float;
float
--------
0.3
```

*The default float falls back to double-precision (float8).*

## ☣️ Floating-point issues in action

Let's dive into calculations with floating-point numbers and see the **potential issues** in action.

Take a straightforward sum of `0.1 + 0.2`

:

```
SELECT 0.1::float + 0.2::float;
0.30000000000000004
```

This result shows how precision issues can arise in double-precision floating-point numbers during arithmetic operations. Even when following standards, we are not immune to these floating-point calculation challenges.

However, there's an alternative strategy that involves a nifty trick using **integers**.

### 💡 A trick with integers

Instead of the float data type, we can work with **integers**. We incorporate a multiplier factor based on a *decimal scale* when storing values, and then divide by the same factor to restore the original decimal representation when retrieving the value.

This method enables precise decimal calculations by leveraging integers and scaling. The multiplier factor should be chosen based on the required decimal precision.

To demonstrate, let's use this trick to perform `0.1 + 0.2`

:

```
SELECT (0.1 * 1000)::int + (0.2 * 1000)::int;
300
```

Here, each input is multiplied by `1000`

and then converted to an integer. To retrieve the original value without losing precision, we divide by `1000`

:

```
SELECT (300 / 1000::float);
0.3
```

*Yay!* 🚀

However, using a fixed multiplier factor may be inefficient when dealing with inputs that have varying decimal places.

Instead, a variable-scale representation could be employed by converting the input into a string and parsing the number of decimal digits.

But be aware, variable-scale decimal representations demand **careful handling of complex calculations**, precise decimal scaling, and various other intricacies of decimal arithmetic.

This is where **decimals** come in.

## 🔵 Decimals to the rescue

Decimals address the challenges associated with complex arithmetic calculations involving decimals. They significantly reduce the precision issues commonly encountered with floating-point numbers.

Various programming languages and database systems have implemented decimals. PostgreSQL provides the datatype **decimal**, which offers superior precision compared to floats.

```
SELECT 0.1::decimal + 0.2::decimal;
0.3
```

Decimals can also be configured for arbitrary precision and scale:

```
# Example: accepts numbers up to 999.99
SELECT 0.1::decimal(5, 2);
0.10
SELECT 999.99::decimal(5, 2);
999.99
```

Handily, the default datatype for decimals in PostgreSQL is **numeric**, which is identical to *decimal*:

```
SELECT pg_typeof(0.1);
numeric
```

## ⚠️ Beware of rounding

Rounding decimal numbers programmatically can lead to imprecise results. For instance, the sum `25.986 + -0.4125 + -25.5735`

should theoretically yield zero:

```
SELECT 25.986 + -0.4125 + -25.5735;
0.0000
```

Let's illustrate how we can round only the final sum to two decimal places:

```
SELECT ROUND(25.986 + -0.4125 + -25.5735, 2);
0.00
```

So far, so good, it works as expected.

With proper datatypes such as *decimal*, the arithmetic issue inherent to floating-point numbers is already addressed.

But **rounding introduces its own set of challenges**. Even if decimals are excellent for precision and arithmetic of decimal data, rounding operations inherently involve *some degree of approximation*.

Now, let's round each number before summing:

```
SELECT ROUND(25.986, 2) + ROUND(-0.4125, 2) + ROUND(-25.5735, 2);
0.01
```

*Uh, oh* 😭

Every time we round some number, we’re adding a bit of imprecision. *Bit by bit*, the final result might be too far from the expected.

These examples underline why **unnecessary rounding should be avoided**. As rounding is an approximation, it's best to postpone it until the final step, i.e., *when presenting the data to the end user*.

## ➕ Decimals in other technologies

Every programming language or technical tool has its own data type for handling arbitrary precision, such as PostgreSQL's decimals.

Ruby offers the BigDecimal class, which facilitates arbitrary-precision floating-point decimal arithmetic.

Similarly, Java also includes a BigDecimal class.

Go language is no exception; it too has arbitrary-precision decimal arithmetic.

It's crucial to verify that the technology you're using provides support for arbitrary precision. If you require greater accuracy, these solutions are often more suitable than using raw floating-point numbers.

## Wrapping Up

In this post, we delved into the intricacies of **floating-point** numbers.

We explored how computers comprehend information through the **binary system**, from integer representation and fixed-point representation's inefficiency for decimals, to floating-point numbers and their *caveats*.

We also investigated how **arbitrary-precision** data types like **decimal** address these precision issues.Furthermore, we discussed **rounding issues** and shared best practices for dealing with them.

I hope these complex topics have been presented in a way that's easy to understand, making floating-point issues no longer an issue!

*Cheers!*

## References

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

https://www.postgresql.org/docs/current/datatype.html

https://en.wikipedia.org/wiki/IEEE_754

https://www.doc.ic.ac.uk/~eedwards/compsys/float/

https://en.wikipedia.org/wiki/Floating-point_error_mitigation

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

https://en.wikipedia.org/wiki/Double-precision_floating-point_format

https://en.wikipedia.org/wiki/Decimal_floating_point

## Top comments (2)

great article! so that’s why some banks store the transaction value as a integer. Where 14.67 becomes 1467

01001110 01101001 01100011 01100101 00100000 01100001 01110010 01110100 01101001 01100011 01101100 01100101 !