Ruby Regexp Part 10 - Unicode

#ruby #regex #ebook #unicode

Unicode

So far in the book, all examples were meant for strings made up of ASCII characters only. However, Regexp class uses source encoding by default. And the default string encoding is UTF-8. See ruby-doc: Encoding for details on working with different string encoding.

Encoding modifiers

Modifiers can be used to override the encoding to be used. For example, the n modifier will use ASCII-8BIT instead of source encoding.

# example with ASCII characters only
>> 'foo - baz'.gsub(/\w+/n, '(\0)')
=> "(foo) - (baz)"

# example with non-ASCII characters as well
>> 'fox:αλεπού'.scan(/\w+/n)
(irb):2: warning: historical binary regexp match /.../n against UTF-8 string
=> ["fox"]

Character set escapes like \w match only ASCII characters whereas named character sets are Unicode aware. You can also use (?u) inline modifier to allow character set escapes to match Unicode characters.

>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]

>> 'fox:αλεπού'.scan(/[[:word:]]+/)
=> ["fox", "αλεπού"]

>> 'fox:αλεπού'.scan(/(?u)\w+/)
=> ["fox", "αλεπού"]

See ruby-doc: Regexp Encoding for other such modifiers and details.

Unicode character sets

Similar to named character classes and escape sequences, the \p{} construct offers various predefined sets that will work for Unicode strings. See ruby-doc: Character Properties for full list and details.

# extract all consecutive letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
=> ["fox", "αλεπού", "eagle", "αετός"]
# extract all consecutive Greek letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
=> ["αλεπού", "αετός"]

# extract all words
>> 'φοο12,βτ_4,foo'.scan(/\p{Word}+/)
=> ["φοο12", "βτ_4", "foo"]

# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>> 'φοο12,βτ_4,foo'.gsub(/\P{L}+/, '')
=> "φοοβτfoo"

Codepoints and Unicode escapes

For generic Unicode character ranges, specify codepoints using \u{} construct. The below snippet also shows how to get codepoints (numerical value of a character) in Ruby.

# to get codepoints from string
>> 'fox:αλεπού'.codepoints.map { |i| '%x' % i }
=> ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]
# one or more codepoints can be specified inside \u{}
>> puts "\u{66 6f 78 3a 3b1 3bb 3b5 3c0 3bf 3cd}"
fox:αλεπού

# character range example using \u{}
# all english lowercase letters
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-\u{7a}]+/)
=> ["fox", "eagle"]

See also: codepoints, a site dedicated for Unicode characters.

\X vs dot metacharacter

Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X to match any character, even if it has multiple codepoints.

>> 'g̈'.codepoints.map { |i| '%x' % i }
=> ["67", "308"]
>> puts "\u{67 308}"
g̈

>> 'cag̈ed'.sub(/a.e/, 'o')
=> "cag̈ed"
>> 'cag̈ed'.sub(/a..e/, 'o')
=> "cod"

>> 'cag̈ed'.sub(/a\Xe/, 'o')
=> "cod"

Another difference is that \X will match newline characters by default.

>> "he\nat".sub(/e.a/, 'ea')
=> "he\nat"
>> "he\nat".sub(/e.a/m, 'ea')
=> "heat"

>> "he\nat".sub(/e\Xa/, 'ea')
=> "heat"

Exercises

For practice problems, visit Exercises.md file from this book's repository on GitHub.

DEV Community

Ruby Regexp Part 10 - Unicode

Unicode

Encoding modifiers

Unicode character sets

Codepoints and Unicode escapes

\X vs dot metacharacter

Exercises

Top comments (0)

Read next

Efficient Chunked File Downloads in Rails: Streaming CSV Exports

Character Encoding with the Python os module and Unicode

Falling in Love with Ruby - First Impressions with The Odin Project

Monkey patching on Ruby