Christian Parpart

Posted on Sep 7, 2021

A Look into a terminal emulator's text stack

#programming #terminal #unicode #emoji

TL;DR

I am going to describe how I implement rendering text in my terminal emulator so that we get programming ligatures, emoji, variation selectors, and yes, even ZWJ emoji.

Introduction

Text rendering in a virtual terminal emulator can be as simple as just iterating over each grid cell's character, mapping it to a Font's bitmap glyph, and rendering it to the target surface at the appropriate position. But it can also be as complex as a web browser's text stack[1] if one may want to do it right.

In contrast to web browsers (or word processors),
terminal emulators are still rendering text the way they did render text 50 years ago - plus some non-standard extensions that did arise over the decades.

Also, terminal screens weren't made with Unicode in mind, Unicode did not even exist back then, so there were a few workarounds and non-standardized ideas implemented in order to display complex Unicode text and symbols in terminals without a common formal ground that terminal application developers can rely on.

Text rendering in a terminal emulator puts some additional constraints on how to render, mostly because character placement is decided before text shaping is performed and must align to a fixed-size grid, which makes it almost impossible to properly render traditional Japanese text into the terminal, or Hewbrew right-to-left text (though, there is a handful of virtual terminal emulators that specialize
on the latter and an informal agreement on how to deal with wide characters on the terminal screen).

Not every character, or to be displayed symbol (such as Emoji) is as wide as exactly one grid cell's width, so additional measurements have to be taken into account for dealing with these characters as well.

Terminals and terminal applications until now did depend on a standard library API call wcwidth() that does ideally return the number of grid cells that an input character will occupy. This however became useless as soon as multi codepoint grapheme clusters have been introduced that still form a single user perceived character. wcswidh() should return the total number of grid cells of a sequence of characters, but these do not handle grapheme clusters nor variation selectors either.

Unicode - a very quick rundown in the context of terminals

Unicode is aiming to have one huge universal space where every imaginable "user perceived character" can be represented. A "user perceived character" is what the user that looks at that character thinks of as one unit. This is in direct contrast to what a character is in computer science. A "user perceived character" can be as simple as one single codepoint (32-bit value representing that
character) and as complex as an ordered sequence of 7 unbreakable codepoints to compose one single "user perceived character".

This places additional requirements to a virtual terminal emulator where each grid cell SHOULD contain exactly one "user perceived character" (also known as grapheme cluster), that is, an unbreakable codepoint sequence of one or more codepoints that must not be broken up into multiple grid cells before the actual text shaping or screen rendering has been performed.

Also, some grapheme clusters take up more than one grid cell in terms of display width, such as Emoji usually take two grid cells in width in order to merely match the Unicode (TR 51, section 2.2) specification's wording that the best practice to display Emoji is to render them in a square block to match the behavior of the old japanese phone companies that that first introduced Emoji.

Rendering text - a top level view

A terminal emulator's screen is divided into fixed width and (not necessarily equal) fixed height grid cells.
When rendering this grid, it is sufficient to iterate over each line and column and render each grid cell individually, at least when doing basic rendering.

Now, when non-trivial user perceived characters need to be supported, the rendering cannot just render each character individually, but must be first grouped into smaller chunks of text with common shared properties, across the grid cell boundaries.

Here we enter the world text shaping.

Text Shaping

Simply put, text shaping is the process of translating a sequence of codepoints into glyphs and their glyph positions. This differs from normal text processors and web browsers in a way because glyph placement in virtual terminal emulators are constrained.

When shaping text of a single grid line, the line is split into words, delimited by spaces, gaps and SGR attributes, that is, each word must represent the same SGR attributes for every "user perceived character" - for example the text's font style (such as bold or italic) or background color must be equal for each position in this sequence, from now on called "word").

The word can be used as a cacheable unit, in order to significantly speed up rendering for future renders.
The cache key is composed of the codepoint sequence of that word, as well as, the common shared SGR attributes.

This cacheable word is further segmented into sub runs by a series categories before text shaping can occur, that is, by Unicode script attribute (such as Latin or Hangul) and by symbol presentation (such as Emoji text presentation or Emoji presentation). This is important because one cannot just pass a string of text to the underlying text shaping engine with mixed properties, such as Hebrew text along with some Latin and Kanji or Emoji in between or a font style change for obvious reasons.
Each segment (usually called run) must be shaped individually with its own set of fallback fonts. Emoji are using a different font and font fallback list than regular text which uses a different font and font fallback list then bold, italic, or bold italic fonts.
Emoji also have two different presentation styles, the one that everybody expects and is named Emoji Emoji presentation (double-width colored Emoji) and the other one is named Emoji text presentation, which renders Emoji in non-colored text-alike pictograms.

The result of all sub runs is composing the sequence of glyph indices and glyph positions and is used as the cache value for the cacheable word to be passed to the next stage, the text renderer.

Text Rendering

The text renderer receives an already pre-shaped string of glyphs and glyph positions relative to screen coordinates of the first glyph to be rendered onto the screen.

In order to lower the pressure on the GPU and reduce synchronization times between CPU and GPU, all glyph bitmaps are stored into a texture atlas on the GPU, such that the text rendering (when everything has been already uploaded once) just needs to deal with indices to those glyph bitmaps into the appropriate texture atlas as well as screen coordinates where to render those glyphs on the target surface.

There is one texture atlas for gray scaled glyphs (this is usually standard text) as well as one texture atlas for colored glyphs (usually colored Emoji). Additionally there can be a third type of texture atlas for storing LCD anti-aliased bitmap glyphs.

Now, when rendering a string of glyphs and glyph positions, each glyph's texture atlas ID and atlas texture coordinate is appended into an atlas coordinate array along with each glyph's absolute screen coordinate and color attributes into a vertex buffer to be uploaded to the GPU.

When iterating over the whole screen buffer has finished, the atlas texture and vertex buffer are filled with all glyphs and related information that are required for rendering one frame. These buffers are then uploaded to the GPU to be drawn in a single GPU render command (such as glDrawArrays, or glDrawElements).

Other Terminal Emulator related Challenges

Most terminal applications use wcwidth() to detect the width of a potential "wide character". This is broken by design and a terminal emulator has to deal with such broken client applications. Some however use utf8proc's utf8proc_charwidth, another library to deal with Unicode,
and maybe some are using wcswidth(), which doesn't handle grapheme clusters nor variation selectors either.

The suggested way for future applications (emulator and client) would be to introduce feature detection and mode switching on how to process grapheme clusters and their width, if legacy apps are of concern.

Just looking at the algorithm, implementing grapheme cluster segmentation isn't too hard but in execution very expensive. Also grapheme cluster width computation is expensive. But luckily, in the context of terminal emulators, both can be optimized for the general case in terminal emulators, which is mostly US-ASCII, and yields almost no penalty with optimizations or a ~60% performance penalty when naively implemented.

Also, implementing proper text shaping into a fixed-grid terminal wasn't really the easiest when there is no other project or text to look at. I used "Blink's text stack" documentation and the Blink's source code (the Web renderer of Google Chrome) as bases and source of truth to understand this complex topic and then mapped my findings to the terminal world.

Since text shaping IS expensive, this cannot be done without caching without severely hurting user experience.

After investigating into the above optimization possibilities however, I do not see why a terminal emulator should not do provide support for complex Unicode, as the performance I have achieved so far is above average at least, and therefore should be sufficient for everyday use.

Bidirectional text was not addressed in this document nor in the implementation in the Contour terminal yet, as this imposes a new set of challenges that have to be dealt with separately. If that is of interest, there are a few terminals (such as mlterm) that have put great effort in getting such scripts into the terminal.
Hopefully this will be eventually added (or contributed) to my project at any time in the future, too and if so, I'll update this document accordingly.

Conclusion

If one went through all the pain on how Unicode, text segmentation, and text shaping works, you will be rewarded with a terminal emulator that is capable of rendering complex Unicode. At least as much as most of us desire - being able to use (power user/) programming ligatures and composed Emoji.

Some terminal emulators do partly support ligatures or rendering trivial single codepoint Emoji or a few of the composed Emoji codepoint sequences, but sadly, most seem to get Emoji wrong. While this is a great start, I'd say we can deliver better and more.

Final notes

I'd like to see the whole virtual terminal emulator world to join forces and agree on how to properly deal with complex text in a somewhat future-proof way.

And while we would be in such an ideal world, we could even throw away all the other legacies that are inevitably inherited from the ancient VT standards that are partly even older than I am. What would we be without dreams. ;-)

Some other terminal emulator developers and I have
started to address at least some of the many Unicode problems that are up until now undefined behavior by creating a formal specification on how a terminal emulator should behave in a backward and forward compatible way so that app developers and users will benefit.

References

Blink's text stack
UTS 11 - character width
UTS 24 - script property
UTS 29 - text segmentation (grapheme cluster, word boundary)
UTS 51 - Emoji

DEV Community