DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Updated on

UTF-8 code points U+1234 meaning

Code point (ssometimes written as "codepoint") is an ordinal position in addressable encoding space.

In ASCII code points were very straightforward because addressable space was continuous. Binary value of a character converted to decimal was a code point. There were 128 code points defined, as you already know from previous posts. For example a character with binary value of 01100001 is at codepoint 97.

raku -e '0b01100001.say'
97
Enter fullscreen mode Exit fullscreen mode

Raku also provides convenient method to get decimal codepoints:

$ raku -e '"a".ord.say'
97
Enter fullscreen mode Exit fullscreen mode

In UTF-8 things get complicated. In previous post about UTF-8 internal design I explained that 1xxxxxxx starting byte is forbidden in multibyte characters, which makes namespace non-continuous.

UTF-8 code points are usually written in hexadecimal notation as U+0105. Let's first learn how to convert code point to binary value of character.

1. Convert hexadecimal value to bits.

$ raku -e '0x0105.base( 2 ).say'
100000101
Enter fullscreen mode Exit fullscreen mode

2. Find smallest possible character byte length that can fit this amount of bits (9 in this case). Control bits does not count.

  • 0xxxxxxx - This has 7 bits left, too small.
  • 110xxxxx 10xxxxxx - This has 11 bits left, perfect!

3. Fill free bits with our codepoint 100000101 bits starting from the right.

110xx100 10000101

4. Fill remaining free bits with 0s.

11000100 10000101

5. Done:

11000100 10000101

Let's check which character U+0105 points to:

$ raku -e 'Buf.new( 0b11000100, 0b10000101 ).decode.say'
ą
Enter fullscreen mode Exit fullscreen mode

And just to confirm:

$ raku -e '"ą".ord.base( 16 ).say'
105
Enter fullscreen mode Exit fullscreen mode

The opposite conversion is straightforward - take binary representation of a character, throw away control bits and convert it to hexadecimal.

Coming up next: Glyphs and graphemes.

Top comments (0)