loading...

How Character Encoding Saved Me

herocod3r profile image Jethro Daniel ・2 min read

The String data type is a very basic yet important data type every developer uses, but with its very common and everyday use, it is easy to neglect the internals of how it actually works. Many thanks partly to most of the new modern languages we have today, giving us string data type on a platter of Gold.

However it is a very important thing to know the inner workings of most of the things we use, we can never tell how it might save us hours of code debugging one day.

So recently i was working on a project that involved cryptography using the AES encryption algorithm, and as part of the requirements of the system, i needed to double encrypt the data using two keys. Like any other developer would do, i went ahead did some googling, copied some code online 😅, pasted it and surely it worked, or so i thought

When it came to testing, i discovered, when i encrypted a string, and proceeded to decrypt it, i was getting back the data with some additional weird characters, this was causing my json serialisation code to break.

Now, the hard part 😪how do i remove the unwanted characters from my decryption method, it looked like a trivial problem, but took me 3days to wrap my mind around the problem.

Finally after 3 days of heavy googling, doubting my existence and life choices 😂, i stumbled on a piece of code from stackoverflow,

            StringBuilder sb = new StringBuilder(s.Length);
            foreach(char c in s)
            {
                if((int)c > 127) // you probably don't want 127 either
                    continue;
                if((int)c < 32)  // I bet you don't want control characters 
                    continue;
                sb.Append(c);
            }
            return sb.ToString();

It turned out, a simple knowledge of Character Encoding would have saved me hours of sleepless nights. What saved me was simple, check the characters that make up the string, any character that is less than 32, or greater than 127 according to ASCII, just discard.

Every letter of the english alphabet and other languages are represented by bytes, over the years there has been different character encoding specifications, that has guided the representation of letters, UTF8,UTF16,UTF32,Unicode etc. A proper knowledge of the character encoding types and the different specifications, can allow you build better systems, especially in the area of languages.

you can learn more from

What Every Programmer should know about string

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Discussion

pic
Editor guide