When I was starting out as a junior developer, it was very hard for me to tell some concepts apart, because I never took the time to sit down and learn about those. Helping with this journey is the intention of this article. I will simplify and shorten a few things here and there to make the topics more digestible and less hard to understand.
This article will only give you an overview over which is which, so that you can dive deeper into the topics that interest you the most. 🙂
Encoding is all about representing some kind of information (let's say a word or a text) in a way that can be conveniently saved or transfered. A popular thing to do on computers is to write a little text (like this one) and save it onto a file storage (like my computer's harddisk). A file storage will only be able to save binary data (think: 1s and 0s), so I need to encode my text in 1s and 0s.
Let's say, I want to encode the German word for the color green in binary form, the result would look like this:
Each letter will be transformer into a series of eight 1s or 0s. Why did I choose to use a German word? Because the German character "ü" does not fit into one set of eight 0s and 1s. This encoding is called UTF-8. You probably have heard this before.
Because using text on a computer is so important for people, people have come up with many ways to encode text, and UTF-8 is the one that is used the most because it can encode almost all language's characters (German, Swedish, Japanese, Chinese,…).
Obviously, when you know that the binary data (1s and 0s) are in UTF-8, you can easily turn them back into your actual characters and display them on the screen again.
The binary data could also represent other kinds of information. A series of 0s and 1s could – for example – be numbers in a spreadsheet or colors in a picture of a puppy. Math (especially in school) mostly uses the decimal system. If our binary data represents data in the decimal system, our encoding would look like this:
There are also other ways to encode data, some of which are base64, HEX or even ascii. Encoding always means the same thing:
I have data and I want to represent it in another system. To achieve this, I encode my data in another system's rules. You don't loose or hide data, it is just in a different format.
For a long time, it has been considered best practices to only store the "hashed" version of a user's password in the database instead of the password itself. But what does hashing mean, and what does this have to do with passwords?
Almost all programming languages come with a series of hashing functions (or you can add those functionality by installing libraries). You can provide some kind of input (like a word or short text) to them and they will return a "hashed" version of that input.
How is this different from Encoding? Hashing functions only work one-way: You can turn your input into a hash, but you cannot turn your hash into the input again. It is a one-way-street. You actually loose some of the information by hashing.
The fact that this is a one-way-street is really great for storing passwords: You actually don't want to store the original password of the user, in case somethings goes sideways and an attacker steals your database.
Hashing functions have differents "strengths": With growing computational power of computers, these functions usually become cracked after some years, which is why it is important to use an up-to-date one. Old hashing functions like
md5 are considered insecure for years now, and the current (early 2020) favorite is
bcrypt (which actually uses a lot of clever ideas to defend itself from being cracked soon).
So hashing is different from encoding, because it intentionally looses some of it's data to become a "one way street" function. Hashes cannot be turned back into the input that produced them, which is why they are great for storing passwords.
Keeping a secret a secret is one of the hardest and most valued things in human interaction and by that also in computing. What is definitely true for secret government files, is just as true for private conversations between coworkers or family members. You really don't want your parents to know that they will be getting a new computer as a holiday gift, which is why you want to encrypt your buying list.
Encryption is a way of turning any input (like a word or a file) into some kind of data, which cannot be turned back into the input without knowing the secret of how to do this. A typical secret is a password: You use a password to encrypt your holiday shopping list and save it on your computer.
So in the simplest case encryption is a "two way street", which allows you to turn input and a password into an encrypted representation. This representation can later be turned back into the input, if you have the password. There are much more sophisticated encryption algorithms, but they all try to do the same thing: Come up with a representation of your data, that is not useful to anybody, if they don't have the password. Encryption does not loose any data.
So what is the difference between all three? Encoding is just a different representation of your input. Hashing is a one-way-street to get a unrecoverable representation. And encryption is a safe way to store information that you want to look at again later.
|Looses data?||Readable by anybody||Output can be turned back into input||Use case|
|Encoding||No||Yes||Yes||Saving text into a file, sending data to a server, …|
|Encryption||No||No||Yes||Saving secret files, sending secret messages or emails.|
I hope this little article helped you to get a general overview of those topics. It is certainly true, that I made a few generalizations to break the topic down into a more digestible format. From here, you can make a deep dive into any of those topics.
- Selection of different encoders: onlinebinarytools.com
- Great base64 encoder: base64encode.org
- Short article, explaining what is wrong about many hashing functions: dusted.codes: "SHA-256 is not a secure password hashing algorithm"
- Wikipedia article about bcrypt