I had to recreate the way dev.to generates anchors in Kotlin Multi-Platform (MPL). It took way longer than expected!
bitdowntoc is a Markdown TOC - Table of Contents - generator I have been developing for a while now. You can copy your markdown, click generate, and have a nice Table of Contents inserted wherever you want. I recently added a "devto" profile, which tries to reuse the anchors already generated by dev.to for headings. This forced me to dig into dev.to anchors generation, and oh boy, it's a mess.
This article explains part of the challenge and digs further into one specific pain point: emojis. For those interested, the full anchor generation implementation for dev.to is available here. Try it yourself โ bitdowntoc (use the devto profile).
- What exactly are anchors
- The mess of dev.to anchors
- Pseudo-code
- Terms of the challenge
- How I ended up stripping emojis
What exactly are anchors
Let's say you add the following heading to your article markdown:
## Hello dev.to!
When you preview or save it on dev.to, it becomes:
<h2>
<a name="hello-devto" href="#hello-devto"></a>
Hello dev.to!
</h2>
As you can see, dev.to automatically adds a <a>
with a name
attribute. The latter is called an anchor: the hello-devto
can be used in a fragment link for navigation, using hashtag + anchor: #hello-devto
(see Finally a clean and easy way to add a table of contents to dev.to articles ๐คฉ for more information).
To generate a table of contents for dev.to, you thus need to know how those anchors are generated, to point to the right fragment link.
The mess of dev.to anchors
Since dev.to is based on forem, which is open-source, it wasn't difficult to find the exact code used to generate anchors. Here it is (html_rouge.rb):
# .. other pre-processing ...
def slugify(string) # here, string = heading title (pre-processed)
stripped = ActionView::Base.full_sanitizer.sanitize string
stripped.downcase
.gsub(EmojiRegex::RGIEmoji, "").strip
.gsub(/[[:punct:]]/u, "")
.gsub(/\s+/, "-")
end
This code goes through the following steps (simplified):
it sanitizes the title using a forem function. Sanitizing means stripping HTML tags, to make the content safe;
it lowercases the whole string (
.downcase
);It strips the emojis using a Ruby library called emoji_regex, which is based upon a javascript library of the same name;
It removes the punctuations with a regex made of a POSIX character class;
It replaces (consecutive) spaces with dashes.
It seems rather straightforward, right? But as easy as it looks, it yields some very, very, strange results. Don't believe me? Have a look at the following markdown document, and try to come up with a valid Table of Content for dev.to:
# Hello dev.to!
# Kotlin is `fun`
# `<hello href="http://link.me">` <hello> `&%ยฃ`
# `<hello>` <world> `&`
# space ๐
# ' ' สป ี ๊ ๊ โฒ โณ โด ใ " หฎ
# 'Hello' means หฎBonjourหฎ en franรงais hรฉhรฉ
# โป emojis ๐ ๐ ๐ โฎ โ โ
# html? <0>
# Check out [bitdowntoc](https://bitdowntoc.ch), It is *awesome*!
Here is what a valid Table of Content should look like:
- [Hello dev.to!](#hello-devto)
- [Kotlin is `fun`](#kotlin-is-raw-fun-endraw-)
- [`<hello href="http://link.me">` <hello> `&%ยฃ`](#-raw-lthello-hrefhttplinkmegt-endraw-raw-ampยฃ-endraw-)
- [`<hello>` <world> `&`](#-raw-lthellogt-endraw-raw-amp-endraw-)
- [space ๐](#space)
- [' ' สป ี ๊ ๊ โฒ โณ โด ใ " หฎ](#-สป-๊-๊-หฎ)
- ['Hello' means หฎBonjourหฎ en franรงais hรฉhรฉ](#hello-means-หฎbonjourหฎ-en-franรงais-hรฉhรฉ)
- [โป emojis ๐ ๐ ๐ โฎ โ โ](#โป-emojis-๐-โฎ-โ)
- [html? <0>](#html-lt0gt)
- [Check out bitdowntoc, It is *awesome*!](#check-out-bitdowntoc-it-is-awesome)
Note: this table of content was entirely generated by bitdowntoc, except the "html? <0>" one. In its current implementation, bitdowntoc would actually generate the anchor "html", which doesn't work. But everything else is supported!
One word: wow. What is this #-raw-lthellogt-endraw-raw-amp-endraw-
?? Just to give you an idea, both GitHub and GitLab will generate a simple #hello
for the same title!
Pseudo-code
I won't go into the details here (I tried writing a comprehensive article, and it was way too long), but for you to better understand how this mess came up to be, here is how you could re-implement the logic (the order matters):
๐คฏ inline code (in backticks) is handled differently than the rest. Take all content between backticks, and escape HTML entities in it (
&
becomes&
,<
becomes<
etc)strip all remaining HTML tags (
<hello>
,<a href="hello">
,</hello>
, etc). This of course doesn't apply to the escaped sequences in 1)handle markdown links: constructs such as
[link text](URL)
are stripped to keep onlylink text
๐คฏ escape the remaining HTML entities (beware of not escaping twice! The
&
in 1 should NOT become&&amp;
)๐คฏ๐คฏ๐คฏ๐คฏ remove some of the emojis (yup, I said some. Keep reading!)
trim: remove leading and trailing spaces
๐คฏ replace all backticks pairs with
-raw-
and-endraw-
(`hello`
becomes-raw-hello-endraw-
)remove all punctuation characters (well, not all, some quotes and other punctuations are kept, I will let you figure out which)
replace all (consecutive) spaces with a single dash (same as GitLab).
Most of the weird stuff comes from steps 1, 4, and 7 - and emojis!
Terms of the challenge
Kotlin MPL - basic code only!
bitdowntoc is entirely written in Kotlin and offers both a command line (a JAR) and a web interface. To avoid repeating the logic twice, most of the code is part of a Kotlin Common module, which I reuse in both a Kotlin JVM and a Kotlin JS module. This is part of Kotlin Multi-Platform (Kotlin MP).
To be executable on different platforms, the common module is written in a way that is independent of the underlying runtime. How is this possible you ask? Because common code uses only a subset of the Kotlin language - the Kotlin Multiplatform Language (Kotlin MPL) - which provides a limited set of APIs and constructs that are guaranteed to be supported by all Kotlin Multiplatform languages.
Kotlin MPL - simple regexes only!
For my project, the most significant limitation of Kotlin MPL is regular expressions (regexes). Kotlin Common aims at interoperability and platform agnosticism. To support regexes, it thus provides a wrapper called Regex
, which takes a string as a single parameter. This string, say .*
, is then "copy-pasted" inside a Pattern
in java - Pattern.compile(".*")
- and a RegExp
object in javascript with the Unicode flag - /.*/u
. Do you see the problem here?
It means any regex that you use in common code must:
be supported by both the Java and JS implementations, and
be interpreted the same by both engines.
Java supports the \p{Latin}
character range, while javascript supports the \p{Emoji}
. Java requires all special characters to be escaped with double backslashes, JS requires only one backslash. The more complex the regex, the higher the chance corner cases will be handled differently.
So in short, I can only rely on very simple regexes: no fancy character classes!
How I ended up stripping emojis
Contrary to other platforms, dev.to doesn't treat emojis like any other special character (that is, just stripping them all). Take this:
# โป emojis ๐ ๐ โฎ ๐ โ โ ๐ฝ โฅ
This is the generated anchors for different platforms:
platform | anchor |
---|---|
dev.to | โป-emojis-๐-โฎ-โ-โฅ |
github | emojis------- |
gitlab | -emojis- |
Why those emojis and not all? This is because dev.to handles emojis separately using a ruby library called emoji_regex
. Here is its full library documentation:
EmojiRegex::RGIEmoji
is the regex you most likely want. It matches all emoji recommended for general interchange, as defined by the Unicode standard'sRGI_Emoji
property.
Looking at the Unicode standard:
ED-27. RGI emoji set โ The set of all emoji (characters and sequences) covered by ED-20, ED-21, ED-22, ED-23, ED-24, and ED-25.
This is the subset of all valid emoji (characters and sequences) recommended for general interchange.
This corresponds to the RGI_Emoji property.
Follow the links if you are interested, you'll see that it doesn't help a lot. You get huge txt files, with lots of "if this emoji is followed by ... but not ... and ..." mumbo jumbo. I was overwhelmed with information, yet couldn't find a simple list of all emojis falling into those categories.
I tried many different things - remember that I cannot use regex character classes, even though one exists in Javascript (\p{Emoji}
). I tried to understand the logic, use Unicode groups in regexes, and even use different regexes depending on the platform (JS/JVM) with some nasty tricks. All in vain.
The epiphany ๐ก came after a long day of trial and error:
let's find a list of all emojis - ideally in the form of a regex (I used the one from https://github.com/sweaver2112/Regex-combined-emojis)
paste it into a dev.to heading,
hit preview and see which emojis are still present in the generated anchor,
remove the emojis still in the anchor (3) from the regex (1).
Jackpot!
In Kotlin code:
fun main() {
// copy from https://github.com/sweaver2112/Regex-combined-emojis,
// then add "|" at the beginning and end of the string
val allEmojis =
"|\uD83E\uDDD1\uD83C\uDFFB\u200Dโค๏ธ\u200D\uD83D\uDC8B\u200D\uD83E\uDDD1\uD83C\uDFFC|...|โซ|"
// paste the allEmojis to a dev.to title,
// take the resulting anchor,
// add "|" between each emoji
// (I used Sublime Text with find: "(.*)", replace: "|\1")
val keptEmojis =
"\uD83D\uDD73|...|โซ"
val finalRegex = keptEmojis
.split("|")
.map { "|$it|" } // whole matches !!!
.fold(allEmojis) { acc, it -> acc.replace(it, "|") }
.replace(Regex("\\|+"), "|")
.trim('|')
println(finalRegex)
// the result is the removeEmojisRegex
}
The final regex is a bit long though, 13,226 characters ๐คช, but it does the job. It isn't even slow during matching!
private val removedEmojisRegex = Regex(
"๐ง๐ปโค๏ธ๐๐ง๐ผ|๐ง๐ปโค๏ธ๐๐ง๐ฝ|...|๐ฅญ|๐|๐|...|โ|โซ|โช|โฌ|โฌ|โพ|โฝ"
)
title
// ... sanitize ...
.replace(removedEmojisRegex, "")
The regex even takes into account the emoji modifiers. The first two blocks (separated by |
), actually represent the emojis ๐ง๐ปโโค๏ธโ๐โ๐ง๐ผ and ๐ง๐ปโโค๏ธโ๐โ๐ง๐ฝ (kiss with different skin tones).
And this is how I completed the challenge, proof that brute force is sometimes the right option, even though it may crash your IDE... 12K+ character long strings with Unicode aren't especially liked by IntelliJ!
PS: this article is a bit different than my previous ones, let me know if this kind of format is of interest to you! I may have other "challenges" in stock. As always, a like or comment would help me keep my motivation up โป.
Top comments (3)
Is there anything special (as far as you can tell) about the emojis that dev leaves in anchors? Or is it just because the Ruby emoji regex library they use doesn't match them?
I would say the kept emojis are the one with a text representation that have been around for a long time combined with the ones that are not considered as supported by major vendors and therefore expected to be usable generally.
But again, I never completely made sense of it. If you have a suggestion, here is the list of emojis that are preserved (afaik):
๐ณ ๐จ ๐ฏ ๐ ๐ ๐ต ๐ด ๐ ๐ ๐ฃ ๐ฟ ๐ ๐ท ๐ธ ๐ต ๐ถ ๐ฝ ๐บ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ฃ ๐ค ๐ข ๐ณ ๐ฅ ๐ฉ ๐ฐ ๐ ๐ฐ ๐ก ๐ค ๐ฅ ๐ฆ ๐ง ๐จ ๐ฉ ๐ช ๐ซ ๐ฌ ๐ ๐ ๐ ๐น ๐ผ ๐ถ ๐ ๐ ๐ ๐ ๐ฅ ๐จ ๐ฑ ๐ฒ ๐ ๐ฝ ๐ฏ ๐ ๐ท ๐ณ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ก ๐ก ๐ ๐ ๐ ๐ ๐ ฐ ๐ ฑ ๐ พ ๐ ฟ ๐ ๐ท ๐ณ โบ โน โ โฃ โค โ โ โ โท โน โ โฐ โฉ โจ โด โ โฑ โฒ โ โ โ โ โฑ โ โ โ โธ โ โฅ โฆ โฃ โ โ โ โจ โ โ โ โ โ โ โ โ โ โ โ โฐ โฑ โ โข โฃ โฌ โ โก โ โฌ โ โฌ โ โ โ โฉ โช โคด โคต โ โก โธ โฏ โ โฆ โช โฎ โถ โญ โฏ โ โฎ โธ โน โบ โ โ โ โง โ โพ โ โป โ โ โ โณ โด โ ยฉ ยฎ โข โน โ ใ ใ โผ โป โช โซ
Sorry but I didn't ready your article yet.
Mardkwon table of contents+emojis+KMP -> I'm directly jumping to use the product instead :)