Basti Ortiz

Posted on May 1, 2019

Thank you to byte-sized integers

#cpp #rust #computerscience

A very brief history of computing numbers

In the olden days, the early pioneers of modern computing had a seemingly insurmountable obstacle ahead of them. They had to figure out how computers were to effectively and efficiently represent the decimal number system. They eventually agreed upon using ones and zeroes to emulate integers. Although it was strange to use a base-2 number system to emulate a base-10 number system, this design proved to be an ingenious one due to the on-and-off nature of ones and zeroes that practically coincided with the one-and-off behavior of transistors, logic gates, arithmetic units, and processing modules. It was basically a match made in heaven.

Problems arose when they had to consider fractions, precision, and negative numbers. Clever innovations and workarounds, such as the sign bit and the two's complement, ultimately led to the formation of the IEEE 754 standard in 1985. The binary format of numbers was designed to liken scientific notation so that it could store and efficiently perform arithmetic operations on a wider range of numbers with greater decimal precision.

Despite its computational flaws and limitations (such as the infamous 0.1 + 0.2 != 0.3), the standard serves as a viable solution/compromise for many computer manufacturers and language designers. In fact, the standard (and its succeeding revisions) is so well-engineered that we often take it for granted. Most of the computers and programming languages we know today either fully adopt or somehow derive from the IEEE 754 standard. They cannot function as efficiently and precisely without such a standardized method of handling floating-point arithmetic.

Memory inefficiencies for dynamically-typed languages

The IEEE 754 standard is a great solution until we consider dynamically-typed languages. High-level, dynamically-typed languages—namely JavaScript and Python, which are two of the most popular and most widely adopted programming languages in recent years—often opt in to complying with the IEEE 754 standard (2008 revision) for all numbers by default. There is no way to customize this behavior because in doing so, the "high-level" and "dynamic" aspect of a "high-level, dynamically-typed" language would be meaningless since one would be required to statically type a number for such a low level of control over memory.

Though beginner-friendly and flexible for many use cases, having all numbers as signed 64-bit (double-precision) floating-point numbers presents quite a huge problem from a memory optimization standpoint. Not all use cases require such range and precision over numbers. Allocating 52 bits for the mantissa, 11 bits for the exponent, and 1 for the sign bit is definitely an overkill for the common usage. 8 bytes are simply too much for a small, positive integer—like the length property of an array for example—that could easily be represented by 1 or 2 bytes.

Nerding out over C++ integer types

C++ gives us greater freedom over the size of our numbers with int, float, and double types and their respective modifiers (signed, unsigned, short, and long). When I first discovered this, my inner nerd was immediately excited by the amount of control I had. Suddenly, I was released from the shackles of dynamically-typed languages. Suddenly, I had the ability to use as little memory as I deemed safe and necessary.

Since I rarely use fractions and negative numbers in my programs, the unsigned short int is my default number type. Unless an API/implementation requires otherwise or I find a real possibility of integer overflow, I have no particular reason to upgrade to a larger integer type. Call me a pedant for needlessly "optimizing", but my inner nerd just finds a lot of satisfaction in saving as much memory as I can. Although it is quite verbose to type unsigned short int everywhere, it is nonetheless a great feeling to know that I have saved 6 bytes of memory for not being forced to use double-precision floating-point numbers.

Going even crazier with Rust integer types

Just when I thought that 2-byte unsigned short int numbers were the ultimate solution to my obsession with memory optimization, Rust comes into my life and slaps me across the face with unsigned 1-byte integers (type-annotated as u8).

Sure, one can argue that C++ also has a construct for a "byte-sized" integer, but semantically speaking, a char is meant to be used as a character, not an integer. Declaring an integer as an unsigned char will surely get the job done for me, but without explicit documentation, it horribly fails to communicate my intent to interpret it as an integer. Simply put, Rust provides a semantically superior construct for "byte-sized" integers compared to C++. As an added bonus, it is also much more convenient to type u8 than unsigned char.

But then you may ask, "What is the point of storing an integer that overflows beyond 255?" Honestly, unless you are using it—for example—to store the value of a color component (as in the RGB color model), there is no clear advantage in using a "byte-sized" integer over an unsigned short int (or a u16 in Rust). You have to be really pedantic (or quite nerdy) like me to even consider using it.

// JavaScript ❌
// Very concise but very memory-inefficient (8 bytes)
const num = 1;

// C++ 😐
// Quite memory-efficient (2 bytes) but very verbose
const unsigned short int num = 1;

// Rust ✅
// Quite concise and very memory-efficient (1 byte)
let num: u8 = 1;

Thank you, "byte-sized" integers

At the end of the day, one can argue that these types of "micro-optimizations" do not have an impact on the overall performance of a program thanks to the greatness of modern CPUs and RAM cards. Yes, I completely agree with that argument, but I did not write this article to assert that "we must always use u8 integers whenever possible". I wrote this article to express my gratitude towards number types for the freedom it gives me as a programmer over memory allocations.

In today's day and age where high-level, dynamically-typed languages rule the software development scene, this degree of control over memory has become a thing of the past... and rightfully so! Nowadays, it has become unnecessarily tedious ~~and rather unproductive~~ to worry about the nuances of memory management, especially with the recent rise in popularity of the "agile development" philosophy.

Nevertheless, I do not allow this to come in the way of finding joy in the little things of life. For me, I find a special satisfaction in maximizing the bits, bytes, and nibbles I have at my disposal. It's not because I want to be unproductive or anything like that; some part of me just pats me in the back when I know I did my best to manage the limited resources I have.

Perhaps this obsession for resource optimization comes from the fact that I was surrounded by low-end devices growing up as a child. I vividly remember how I couldn't bare spending another second waiting for a program to finally become responsive. From then on, I have made a promise to myself that I will never write software that would make people experience the excruciatingly long waiting times I have experienced during my childhood.

And for that reason, I say "thank you" to number types—namely the unsigned short int of C++ and the u8 of Rust—for allowing me to fulfill my lifelong devotion to optimization and for being by my side whenever I need my daily dose of memory optimization to cheer me up.

🥂 Cheers to number types! 🥂

Top comments (9)

Vincent Milum Jr • May 2 '19

Some things to be aware of at least with C/C++, using smaller variable types might not actually save any RAM at all.

If you do a basic loop like:

for (unsigned int i=0; i<10; i++) {}

The variable "i" is never even in RAM to begin with. It will be used exclusively within a CPU register. By forcing the data size, you could inadvertently be micro-optimizing the code, because each iteration may need to perform an extra execution to truncate the data within the register to match the given data type (the compiler should be aware of this though and usually optimize it away)

In your other example:

const unsigned short int num = 1;

Is another case of the same thing. The compiler will see it as an unchangeable variable, therefor replace it entirely with an inline literal when generating the assembly code, making the data size moot as well (but once again possibly needing to truncate it with extra instructions due to forced data size)

This is honestly why I love just using the "auto" variable type now. Let the compiler figure out and optimize the best possible data type given the situation and surrounding code, instead of trying to hand-optimize little details.

The main time to worry about specific data sizes I think has to deal with repeatable data, such as arrays and structs. This is especially true when dealing with hardware drivers, where you may need to match a specific bit-specific data structure to match hardware.

Speaking of which, in C/C++, the smallest variable size is actually a single bit, not a byte! But be aware of the CPU and memory controller's minimum "word" size. Allocating a single bit will use up to a "word" of data. Dealing with i386 and AMD64, we can read-write a single byte. Other systems have a minimum of 2-byte or 4-byte words. In these systems, if you want to write a single byte, the compiler behind the scenes actually generates code to pull the word into a register, replace the single byte, then write the register back as a full word. Very slow for single-byte writes, but exceptionally fast when writing bulk data sequentially!

Also, despite it being called a "char", it really is just the same thing as a u8. The terms were made a very long time ago! Regardless of the terms, it is really all about how the underlaying CPU handles things! These languages are simply pulling in CPU raw features and abstracting them away using simpler concepts :)

Vincent Milum Jr • May 2 '19

Also for reference, the u8, i8, u16, i16 etc actually came from a handful of C/C++ libraries, especially in the video game programming world. Rust merely adopted what these libraries were already doing, making it the standard, since it is easier to read/write ;)

And in your picture above, a "long" and "long int" are both traditionally 32-bit, not 64-bit. "int" is a variable length number of bits depending on architecture, and "long long" is the 64-bit variant! This is especially helpful to note when dealing with micro-controllers that may be 8-bit or 16-bit CPUs, or moving up to 64-bit CPUs.

Basti Ortiz • May 3 '19

Oh, boy. That's a lot of low-level stuff I've never even considered. And to think, C++ used to be classified as a "high-level language". Indeed, I have much to learn. Thanks for taking the time to write all of this down for me. It means a lot.

jeikabu • May 1 '19

cstdint/stdint.h has int8_t, uint8_t, and some other typedefs you might find handy. =)

Basti Ortiz • May 2 '19

Ooooh, thanks! I never knew this library existed. That's pretty cool. I'll consider using it. Under the hood, are these typedefs just class wrappers that take advantage of type punning to accomplish this fancy business?

jeikabu • May 2 '19

Typedefs don't involve classes or anything that complicated. They're rather similar to Rust's type aliases.
"stdint" should be part of the libraries included with whatever compiler you're using. It should "just work" without having to install anything else.

Basti Ortiz • May 2 '19

This is great! I'll see how I can use them properly in my side projects.

HS • May 1 '19

Nice article. Although it's not "thing of the past" per say. C# and Java, for example do some optimizations behind the scene where using lesser type like short will result in performance drop. At least there's code examples (this case C#) where people switched from short to integers in loops and gained performance just by making all loops written as usual (at least I'm used to everyone using int in loops). Thing is some guy tried to optimize usage by thinking about those things and instead it gave the opposite reuslt. So it's not like we don't think about this, it's more like some weird stuff happens if you try to micro optimize.

Basti Ortiz • May 2 '19

That is very strange indeed. I should run performance tests just in case this doesn't happen for me. Thanks for the heads up!