Gary Kramlich

Posted on Apr 19, 2023

Parsing IRCv3 with Regex

#ircv3 #modernization #pidgin3 #repost

This article was originally posted on Patreon and has been brought over here to get all of the Pidgin Development/History posts into one single place.

To develop the modern chat features that everyone is expecting in Pidgin 3 we need to have protocols that support them as well. Unfortunately most of in-tree protocols either don't support these features or have so much tech debt that adding these features is a non-trivial amount of work.

Please note that any copyrighted code in this post is licensed by me, Gary Kramlich grim@reaperworld.com, under the MIT License.

To combat these problems, we decided to write a brand new, from scratch, protocol plugin for IRCv3. One of the biggest benefits of this decision is that this protocol plugin is the first one in our history to ever be coded reviewed from the very beginning. This has meant slightly slower development, but we're accumulating much less tech debt which is a huge win.

One of the big issues with the existing IRC protocol plugin is that it uses Lexical Analysis to parse the IRC lines and work with them. While this works and is a common solution to this problem, it ends up being very difficult to understand when looking at the code at a later date, especially if the code is not commented well. To tackle this problem, we've chosen to use regular expressions. Cue XKCD 1171 and others.

Awhile back, I wrote a proof of concept protocol plugin for Twitch which is where I first started playing with regular expressions for parsing IRC. After testing this on some very fast Twitch channels, it became evident that the regular expressions could keep up and were a viable way forward.

There's many reasons I prefer regular expressions over a blob of code doing lexical analysis. First of all, regular expressions are reusable in any other language. Which means you can write and test them once then use them many times. Second, well written regular expressions are much easier to read than hundreds of lines of code. Finally, it's easier to adjust a regular expression for changes to the format in a backwards compatible way than the hundreds of lines of code as well. Giving examples of this reasoning is out of scope for this post, but if there is interest I suppose I could go into more detail in another post.

With all of that out of the way, lets get to the fun stuff! To make this happen, we use a number of regular expressions to accomplish our task. If we tried to do this with a single regular expression, it would be impossible to read and maintain.

IRCv3 passes lines, that is a string of text that ends in a \r\n. What this means is we used a buffered input stream to read a line and then we run that line through our first regular expression.

The first regular expression's job is to split the IRC message into the expected fields of tags, source, command, middle, coda, and trailing. We don't really use the coda, but it may be useful for some. These names are all from the ABNF in the protocol documentation. We'll explain how this is used in just a bit. New lines have been added for readability only.

(?:@(?<tags>[^ ]+) )?
(?::(?<source>[^ ]+) +)?
(?<command>[^ :]+)
(?: +(?<middle>(?:[^ :]+(?: +[^ :]+)*)))*
(?<coda> +:(?<trailing>.*)?)?

Once we have all of these tokens, we want to parse the tags token and turn it into something usable. To do that we pass the value of the tags named group into the following regular expression that will match multiple times.

(?:(?<key>[A-Za-z0-9-\\/]+)(?:=(?<value>[^\\r\\n;]*))?(?:;|$))

We then create a hash map to contain all these values as we parse them.

Now that we have the base message and the tags parsed we can discuss what to do will all of this data. The middle, code, and trailing tokens can be quite confusing at first, but it's not too bad once you get the hang of it. As I mentioned earlier, we're not using the coda token, so we'll be ignoring it here.

When it comes to middle and trailing, the important thing to remember is that middle is a space separated list of parameters and trailing on the other hand is a single string that can contain spaces. To put this in perspective, think of the command token as a function name, middle as a list of parameters, and trailing as the final parameter. Something like the following pseudo code:

args = middle.split(" ")
args.append(trailing)
command(args)

Say we have a command of PRIVMSG, middle of #pidgin, and trailing of Hiya! How's it going? If we use these values to fill in the pseudo code we'd get something like the following:

args = "#pidgin".split(" ")
args.append("Hiya! How's it going?")
privmsg(args)

We use an array because we have no idea what kind of argument each command requires. We could try to codify this, but that depends a lot on the programming language that you're implementing this in.

In Pidgin, which is written in C, we can't really get too fancy, so we create a hash map of functions that's keyed on the command name. We check for command in the hash table, if it's found, we pass our array of arguments, and if not, we call our fallback handler which typically just logs what we failed to parse so we can find it and fix it later.

This architect allows us to keep the parser very simple and leave all of the implementation details up to the command as we're implementing them. For example, the PING command may come in with 0 or 1 parameters. If a parameter is specified, we're expected to send it back. So the pseudo code for that is basically

func ping(tags, source, command, params) {
    if(params.length() == 1) {
        send("PONG %s", params[0])
    } else {
        send("PONG")
    }
}

The tags, source, and command parameters are unused here. Remember, the parser doesn't know what each implementation needs, so it passes all of the tokens to them. The command parameter is typically used to handle commands that are functionally aliases of each other, like PRIVMSG and NOTICE. Their differences are usually a user interface implementation, but protocol wise they're exactly the same so being able to know this in the implementation allows us to set a flag noting the difference.

So that's about it for the main IRCv3 parsing, but as I mentioned, this all started with parsing Twitch's IRCv3 and that gets into some more regular expressions which will quickly cover as well.

Twitch uses IRCv3 tags extensively, but the biggest most complicated use of IRCv3 tags is for handling emotes (emojis). Emotes are a huge part of Twitch and there are a lot of them. To avoid wasting tons of CPU time on scanning short messages for millions of emotes, Twitch uses the emotes tag to tell the client where they are. The value of an emote tag (remember we parsed this into a hash map earlier) looks like the following:

301696583:0-9,25-29/1290325:51-56

The emotes value is defined as the id of the emote, followed by a : and then a comma separated list of ranges of the text to replace in the message. Additional emotes can be specified by separating them with a /.

Again, we take a multiple regular expression approach to parsing this. First, we do a match all to get each emote and all of their ranges using the following regular expression:

(?:(?<id>[^:/]+)):(?<ranges>[^/]+)/?

Now that we have the ranges separated, we can split them into their individual values via the following regular expression:

(?<start>[^-,]+)-(?<end>[^,]+),?

Now you have all the pieces you need to build the message for display and replace the text with the actual emote but we won't be tackling that code here as this post is all about parsing :-D.

Twitch also uses another simple format for the badges and badge-info tags. These are used to tell the client what badges the user has in the channel and information that goes with them. There's additional documentation here, but we'll look at a simple examples of badges here.

subscriber/0,bits-leader/2,bits/100

We can use the following regular expression to parse this into the id and value for each badge.

(?<id>[^\/]+)\/(?<value>[^,]+),?

Finally, the last regex we're going to cover is to help handle cheermotes on Twitch. A cheermote is displayed like a normal emote, but it also sends a monetary donation to the streamer. However, unlike a normal emote, cheermotes don't show up in a tag on the message, which means we need to manually parse them out of the message content.

To make maters more complicated, partnered streamers can have their own cheermotes. This means when you join a channel, you have to make a request to the Twitch API to get the cheermotes available on the channel and then dynamically created the regular expression. This isn't too bad, but does complicate things. To keep things simple here, we're going to assume that Twitch told us that this channel supports the Cheer, RIPCheer, and CheerWhal cheermotes. With that, our regular expression looks like the following

\b(?:(?<emote>Cheer|RIPCheer|CheerWhal)(?<amount>\d+))\b

If we run that against the following example message, we'll see we match the Cheer emote and it has a value of 10.

Hiya, it's been awhile... Cheer10!

That's about everything I have for now. I hope you all enjoyed this in depth look at parsing with regular expressions. I know I didn't explain how the regexes work, and well, this post is already long enough without that. If you would be interested in that, please leave a comment!!

DEV Community

Parsing IRCv3 with Regex

Top comments (0)