DEV Community

Cover image for A unicode substitution cipher algorithm
Victoria Drake
Victoria Drake

Posted on • Edited on

A unicode substitution cipher algorithm

Full transparency: I occasionally waste time messing around on Twitter. (Gasp! Shock!) One of the ways I waste time messing around on Twitter is by writing my name in my profile with different unicode character "fonts," π–‘π–Žπ–π–Š π–™π–π–Žπ–˜ π–”π–“π–Š. I previously did this by searching for different unicode characters on Google, then one-by-one copying and pasting them into the "Name" field on my Twitter profile. Since this method of wasting time was a bit of a time waster, I decided (in true programmer fashion) to write a tool that would help me save some time while wasting it.

I dubbed the tool uni-pretty. It lets you type any characters into a field and then converts them into unicode characters that also represent letters, giving you fancy "fonts" that override a website's CSS, like in your Twitter profile. (Sorry, Internet.)

uni-pretty screenshot

The tool's first naive iteration existed for about twenty minutes while I copy-pasted unicode characters into a data structure. This approach of storing the characters in the JavaScript file, called hard-coding, is fraught with issues. Besides having to store every character from every font style, it's painstaking to build, hard to update, and more code means it's susceptible to more possible errors.

Fortunately, working with unicode means that there's a way to avoid the whole mess of having to store all the font characters: unicode numbers are sequential. More importantly, the special characters in unicode that could be used as fonts (meaning that there's a matching character for most or all of the letters of the alphabet) are always in the following sequence: capital A-Z, lowercase a-z.

For example, in the fancy unicode above, the lowercase letter "L" character has the unicode number U+1D591 and HTML code 𝖑. The next letter in the sequence, a lowercase letter "M," has the unicode number U+1D592 and HTML code 𝖒. Notice how the numbers in those codes increment by one.

Why's this relevant? Since each special character can be referenced by a number, and we know that the order of the sequence is always the same (capital A-Z, lowercase a-z), we're able to produce any character simply by knowing the first number of its font sequence (the capital "A"). If this reminds you of anything, you can borrow my decoder pin.

In cryptography, the Caesar cipher (or shift cipher) is a simple method of encryption that utilizes substitution of one character for another in order to encode a message. This is typically done using the alphabet and a shift "key" that tells you which letter to substitute for the original one. For example, if I were trying to encode the word "cat" with a right shift of 3, it would look like this:

c a t
f d w
Enter fullscreen mode Exit fullscreen mode

With this concept, encoding our plain text letters as a unicode "font" is a simple process. All we need is an array to reference our plain text letters with, and the first index of our unicode capital "A" representation. Since some unicode numbers also include letters (which are sequential, but an unnecessary complication) and since the intent is to display the page in HTML, we'll use the HTML code number 𝕬, with the extra bits removed for brevity.

var plain = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'];

var fancyA = 120172;
Enter fullscreen mode Exit fullscreen mode

Since we know that the letter sequence of the fancy unicode is the same as our plain text array, any letter can be found by using its index in the plain text array as an offset from the fancy capital "A" number. For example, capital "B" in fancy unicode is the capital "A" number, 120172 plus B's index, which is 1: 120173.

Here's our conversion function:

function convert(string) {
    // Create a variable to store our converted letters
    let converted = [];
    // Break string into substrings (letters)
    let arr = string.split('');
    // Search plain array for indexes of letters
    arr.forEach(element => {
        let i = plain.indexOf(element);
        // If the letter isn't a letter (not found in the plain array)
        if (i == -1) {
            // Return as a whitespace
            converted.push(' ');
        } else {
            // Get relevant character from fancy number + index
            let unicode = fancyA + i;
            // Return as HTML code
            converted.push('&#' + unicode + ';');
        }

    });
    // Print the converted letters as a string
    console.log(converted.join(''));
}
Enter fullscreen mode Exit fullscreen mode

A neat possibility for this method of encoding requires a departure from my original purpose, which was to create a human-readable representation of the original string. If the purpose was instead to produce a cipher, this could be done by using any unicode index in place of fancyA as long as the character indexed isn't a representation of a capital "A."

Here's the same code set up with a simplified plain text array, and a non-letter-representation unicode key:

var plain = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'];

var key = 9016;
Enter fullscreen mode Exit fullscreen mode

You might be able to imagine that decoding a cipher produced by this method would be relatively straightforward, once you knew the encoding secret. You'd simply need to subtract the key from the HTML code numbers of the encoded characters, then find the relevant plain text letters at the remaining indexes.

Well, that's it for today. Be sure to drink your Ovaltine and we'll see you right here next Monday at 5:45!

Oh, and... β”β βŸβ˜β£β’β₯⍦⍝⍒β₯⍚⍠⍟⍀ β’βŸβ• β¨β–ββ”β βžβ– β₯⍠ β₯⍙⍖ ⍔⍣βͺ⍑β₯βšβ” β¦βŸβšβ”β β•β– ⍀⍖⍔⍣⍖β₯ β€β β”βšβ–β₯βͺ

:)

Top comments (9)

Collapse
 
ben profile image
Ben Halpern

π”»π•’π•žπ•Ÿ π•₯𝕙𝕒π•₯ π•šπ•€ 𝕒 π•—π•¦π•Ÿ π•₯𝕠𝕠𝕝

π”Έπ•žπ•’π•«π•šπ•Ÿπ•˜ 𝕛𝕠𝕓 π•¨π•šπ•₯𝕙 𝕒𝕝𝕝 𝕠𝕗 π•₯π•™π•šπ•€ π•π•šπ•”π•œπ•ͺ

Collapse
 
dwd profile image
Dave Cridland • Edited

>>> u''.join([ c if c == ' ' else unichr(ord(c) - 0x2352 + ord('A')) for c in s ])

I can never resist these things.

The three letter words were helpful - there's limited options there, so I thought aiming for an AND or a THE would be a good crib. The fact that you've left spaces unencoded does, of course, make this much simpler.

I do get, though, that this article isn't about cryptography. :-)

Collapse
 
victoria profile image
Victoria Drake

Python3:

''.join([ c if c == ' ' else chr(ord(c) - 0x2352 + ord('A')) for c in s ])

I love this. Let's be secret code buddies.

Collapse
 
dwd profile image
Dave Cridland

β’βŽ„βŽβ΄βŒ½βŒ―β™βŽ„βŽ‚βŽƒβŒ―βŽβ΄βΌβ΄βΌβ±β΄βŽβŒ―βŽƒβΎβŒ―β΄β½β²βΎβ³β΄βŒ―βŽˆβΎβŽ„βŽβŒ―βŽ‚βΏβ°β²β΄βŽ‚βŒ―β½β΄βŽ‡βŽƒβŒ―βŽƒβΈβΌβ΄βŒ½

Collapse
 
alephnaught2tog profile image
Max Cerrina

I feel so 𝓯π“ͺ𝓷𝓬𝔂!

Collapse
 
qm3ster profile image
Mihail Malo

I like this approach in node:

big = str => {
  const out = Buffer.from(str, "ucs2"),
    len = out.length
  for (let i = 0; i < len; i += 2) {
    const ascii = out[i]
    if (ascii < 0x21 || ascii > 0x7E) continue
    out[i] = ascii - 0x20
    out[i + 1] = 0xff
  }
  return out.toString("ucs2")
}
big("Big Chungus")

Could probably do similar with TextEncoder in web.

Collapse
 
krthr profile image
Wilson Tovar

πŸ…»πŸ…ΎπŸ…ΎπŸ…ΊπŸ†‚ πŸ…°πŸ…ΌπŸ…°πŸ†‰πŸ…ΈπŸ…½πŸ…Ά

Collapse
 
vinayjn profile image
Vinay Jain

π–œπ–”π–œ π–™π–π–Žπ–˜ π–Žπ–˜ π–˜π–” π–ˆπ–”π–”π–‘

Collapse
 
cadonau profile image
Markus Cadonau

Regarding accessibility I would avoid using such character replacements on public profiles.