DEV Community

Cover image for Implementing UTF-8 Encoding in Zig
Joshua Matthews
Joshua Matthews

Posted on • Edited on

Implementing UTF-8 Encoding in Zig

tl;dr
Created a library to read/write UTF-8 encoded Unicode values in Zig for my simple text editor. Link to GitHub repo

Context

One of my side projects is a simple text editor in C called "Editlite". The purpose of this editor was to explore creating one from the ground up and implementing functionality like Plugins and loading files incrementally to support editing large files. For this article I'll focus on just the Unicode support I wanted to add, though I may do a deeper dive into the editor in a future article.

Contents

  • Intro
  • The Beginning
  • The Format
  • The Build
  • C Header
  • The Integration

Intro

Learning Zig has been on my todo list for some time now and I have been trying to find more excuses to work it into projects to become more familiar with the language. I finally found a little project I could get my feet wet while also producing something that wouldn't be just a throw away program. I have a simple text editor and it currently only supports the ASCII character set, so I thought adding a basic Unicode support library would be the perfect byte-sized intro to Zig I wanted.

The Beginning

Unicode is a standard that lays out how numerous symbols are mapped to certain code points so that applications can read and correctly display the appropriate text on any system. For example the number 69 (Hex value: 45) maps to the character E. However, number 42069 (Hex value: A455) maps to character .

There are different encodings to support this standard but the one I chose to implement was the UTF-8 standard, which breaks these numbers up into 8bit unsigned integer type representations over the range of 1 to 4 bytes.

The Format

The best way to visualize the UTF-8 encoding format is with this table from the RFC. Also to note, there are two things we will not cover in this article and that is the Byte-order Mark which is not required (default is big-endian) and the UTF-16 Surrogates (reserved byte range U+D800-U+DFFF) which should be considered invalid in the UTF-8 format.

UTF-8 format table
(Fig. 1)

We'll break down each line and start codifying it into Zig for the baseline rules of our library.

The first example 0xxxxxxx allows for this standard to be backwards compatible with the ASCII character set. The first bit needs to be 0 and you have 7 free bits to use. So we need to ensure that first bit is zeroed out to be a valid 1 byte UTF-8 octet sequence.

zig function checking the one octet marker
(Fig. 2)

The next row starts defining the main pattern we will follow for the other sequence types. The leading 110 bits signify that this sequence is a 2 byte UTF-8 octet sequence and the following octet leads with 10 to signify this is the next byte in the current octet sequence. So we see this encoding uses a 0 padding bit to differentiate between all of these octet sequences.

This is how we could codify those two rules in Zig.

zig functions to check the next and two octet marker
(Fig. 3)

The following rows follow the same rules but for the 3 byte and 4 byte octet sequences with 1110 and 11110 respectively.

zig functions to check the three and four octet marker
(Fig. 4)

Now that we've codified these rules we can move forward with writing our library.

The Library

The first thing we'll do is define some convenience types to work with.

zig enum and structure to represent octet info
(Fig. 5)

I wanted an easy way to keep track of what type of octet sequence type a certain code point was without having to re-encode it. Also this library is meant to be used in an existing C project so we need to create our enum using c_int and add the extern keyword to our structure.

zig function to verify utf8 octets
(Fig. 6)

Next we define a convenient function to verify multi-byte octet sequences.

zig function to get the octet type from u8
(Fig. 7)

The last convenience function we will explain is this one which determines which octet type is the given 8bit value. We'll explain the export keyword with the next screenshot.

Now the core of the library! The first function we'll define is the parsing function. I like to start at the unit level so we'll write a function to read the "next" Unicode code point in a given u8 array. Here's the definition of the function:

zig function definition for parsing next code point
(Fig. 8)

Of course, we add the export keyword to tell Zig to compile this function with the C ABI and we also need to use C compatible types -- so the arr parameter is a [*]const u8 which is a slice of "unknown" size.

Alright, now to the meat of the function:

zig function of implementation of parse functionality
(Fig. 9)

We start off with a reusable invalid_point object to return on errors. Next we do some housekeeping checks. We then define our initial result which we grab the starting octet out of the array and determine it's octet type with get_oct_type. Next we just switch on the type and try to parse from there.

The first two cases are easy. If the initial type was OCT_INVALID or OCT_NEXT then this isn't a correctly formatted UTF-8 string so we return an "invalid" code point. For a OCT_ONE type we just pull the value straight out.

The rest of the cases are a little more involved but still straightforward. We check to make sure there are an expected number of bytes based on the type. We also verify the rest of the bytes are formatted properly with verify_octets. Then we pull each value out and & it with it's corresponding free bits. Lastly, we shift the values based on their free bits (6 * offset) and logical or the values into our code point in big-endian format.

That's it for parsing! Now you can use this method to loop over your UTF-8 encoded buffers!

The next crucial function to implement is the "write" function -- to take a u32 code point and write it back out to a UTF-8 encoded buffer.

zig function of write functionality
(Fig. 10)

This function is a thin function for the C API to take in a C-style array and turn it into a Zig slice then pass it onto the real write function.

zig function of write functionality, full
(Fig. 11)

The write function is also straightforward in it's implementation. Switching on the code point's type and pulling out the appropriate byte information from the u32 in big-endian format. You'll notice some convenience functions that are just thin inline functions to ensure the value is in the correct format of the UTF-8 octet sequence markers.

Now we can continually write out code points to a given u8 buffer!

The Build

Next, the build.zig file. Most of this is standard but we needed to bundle the Zig compiler runtime into the static build, so this is how the build file looked.

zig build file
(Fig. 12)

C Header

Next we need to generate the C header file to accompany the static library. Luckily the types easily map over.

C header file
(Fig. 13)

You can ignore the __THROWNL and __nonnull(()) calls, the important part to take away from this is we need to set the extern keyword in front of our functions. Now we can use this in our C programs like you would normally include a static library!

The Integration

After integrating this library into my simple text editor I can now read/write and accept Unicode input.

Text editor displaying Unicode characters
(Fig. 14)

Top comments (0)