DEV Community

Cover image for Getting the precision for a float

Getting the precision for a float

teclado profile image Teclado Mugriento Updated on ・3 min read

When using floating-point numbers, we're not used to think about the precision of theirs value. We only want a variable that can change in a continuous form, but we never take into account that this variables don't take a continuous range of values due to their binary (and finite) representation.

If we attend to the IEEE754 specification, we can see the consensual meaning of each bit of this representation for each of the types we handle.

There we see a space reserved to the sign and other for the exponent. That's ok, since they should be defined as integers. Is the mantissa which should be continuous, but due to its limited size, it's not. It behaves (with a fixed sign and exponent) just as an integer.

Any scientist will agree with that if the floating-point number was matched up with other one (may be one with less precision) that would indicate a range containing the actual (and infinite precision) value of the number we're representing.

However, this has been never implemented in the core logic of any computer, so we need to build it by ourselves.

If we have after some calculations a final result, its least significant digit is never precise. This is because we had to round a real number to fit in a fixed length binary representation, so the error we get in the result is never less than the value of this bit (which changes as the exponent changes) divided by two.

Here is an example of a little C code that get this error for a floating point number in a big-endian machine:

float errf(float x)
    float nearest = x;
    void *vnp = &nearest;
    char *cnp = (char *)vnp;
    cnp[0] ^= (char)1;
    return fabsf(x-nearest);

If the rounding has been taken correctly, we might divide this by 2 and get a sensible result, but I'm in the doubt right now about this. In addition, if we're in a little-endian machine, we should flip the last byte, and not the first.

This piece of code flips the least significant bit, getting the next faithfully representable number if this bit is a 0, and the previous one if it's a 1, in order to substract it from the provided value.

We could get the manner of get both in each case, and subtract the previuos to the next, but the only difference appears when all bits in the mantissa are 0, so the previous representable number is nearer the value x we're asking for. This means that we can always be sure the actual real (infinite precision again) value is somewhere between x-errf(x) and x+errf(x).

Feel free to comment other methods of getting this error, may be in other languages ;-D.

Doing operations

The trick with this topic is that if we want to operate with float values, we have to take this error into account. If we add up three times 0.5, which have an intrinsic error of 0.000000060, the errors add too, so we get a result of 1.50000000±0.00000018. All having a intrinsic one of errf(1.5)=0.00000012.

Multiplication, division and subtraction have their own rules too. It was wonderful if we take this in account in each operation we do. When we do many of them as often happens, the accumulated error can grow pretty fast, and we should be prepared to discard a result when this one gets greater than some sensible limit.


Cover image: Wikipedia

Discussion (0)

Editor guide