DEV Community

loading...

TIL: Unicode chars in regex

lasseebert profile image Lasse Skindstad Ebert ・1 min read

Today I needed to match some unicode chars in an Elixir regex.

TL;DR:

Use u modifier and \x{...}, e.g. ~r/\x{1234}/u

Matching a unicode char in a regex

More specifically, I needed to remove all zero width chars from a string.
These are U+200B, U+200C, U+200D and U+FEFF.

Trying to use \u does not work:

iex(1)> ~r/\u200B/
** (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 1
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

Looking at the docs, it seems that \x{} is the way to go, but no:

iex(1)> ~r/\x{200B}/
** (Regex.CompileError) character value in \x{} or \o{} is too large at position 7
    (elixir) lib/regex.ex:209: Regex.compile!/2
    (elixir) expanding macro: Kernel.sigil_r/2
    iex:1: (file)

The trick is that we need to apply a unicode (u) modfier to the regex, telling the regex compiler that we're working in Unicode:

iex(1)> ~r/\x{200B}/u
~r/\x{200B}/u
iex(2)> "Hello,\u200BWorld!" |> String.replace(~r/\x{200B}/u, "")    
"Hello,World!"

Yay!

So my final regex could be something like:

~r/\x{200B}|\x{200C}|\x{200D}|\x{FEFF}/u

Interpolation works too.

We can also interpolate strings into a regex, which works the same way and works without the u modifer:

iex(5)> "Hello,\u200BWorld!" |> String.replace(~r/#{"\u200B"}/, "")
"Hello,World!"

Discussion

pic
Editor guide