## DEV Community

Paweł bbkr Pabian

Posted on • Updated on

# UTF-8 code points U+1234 meaning

Code point (ssometimes written as "codepoint") is an ordinal position in addressable encoding space.

In ASCII code points were very straightforward because addressable space was continuous. Binary value of a character converted to decimal was a code point. There were 128 code points defined, as you already know from previous posts. For example `a` character with binary value of `01100001` is at codepoint `97`.

``````raku -e '0b01100001.say'
97
``````

Raku also provides convenient method to get decimal codepoints:

``````\$ raku -e '"a".ord.say'
97
``````

In UTF-8 things get complicated. In previous post about UTF-8 internal design I explained that `1``xxxxxxx` starting byte is forbidden in multibyte characters, which makes namespace non-continuous.

UTF-8 code points are usually written in hexadecimal notation as `U+0105`. Let's first learn how to convert code point to binary value of character.

1. Convert hexadecimal value to bits.

``````\$ raku -e '0x0105.base( 2 ).say'
100000101
``````

2. Find smallest possible character byte length that can fit this amount of bits (9 in this case). Control bits does not count.

• `0``xxxxxxx` - This has 7 bits left, too small.
• `110``xxxxx` `10``xxxxxx` - This has 11 bits left, perfect!

3. Fill free bits with our codepoint `100000101` bits starting from the right.

`110xx``100` `10``000101`

4. Fill remaining free bits with `0`s.

`110``00``100` `10000101`

5. Done:

`11000100` `10000101`

Let's check which character `U+0105` points to:

``````\$ raku -e 'Buf.new( 0b11000100, 0b10000101 ).decode.say'
ą
``````

And just to confirm:

``````\$ raku -e '"ą".ord.base( 16 ).say'
105
``````

The opposite conversion is straightforward - take binary representation of a character, throw away control bits and convert it to hexadecimal.

Coming up next: Glyphs and graphemes.