DEV Community

loading...
Cover image for Ruby Regexp Part 10 - Unicode

Ruby Regexp Part 10 - Unicode

Sundeep
Addicted to writing books and teaching. Likes fantasy books, comics and anime
Originally published at learnbyexample.github.io ・2 min read

Unicode

So far in the book, all examples were meant for strings made up of ASCII characters only. However, Regexp class uses source encoding by default. And the default string encoding is UTF-8. See ruby-doc: Encoding for details on working with different string encoding.

Encoding modifiers

Modifiers can be used to override the encoding to be used. For example, the n modifier will use ASCII-8BIT instead of source encoding.

# example with ASCII characters only
>> 'foo - baz'.gsub(/\w+/n, '(\0)')
=> "(foo) - (baz)"

# example with non-ASCII characters as well
>> 'fox:αλεπού'.scan(/\w+/n)
(irb):2: warning: historical binary regexp match /.../n against UTF-8 string
=> ["fox"]
Enter fullscreen mode Exit fullscreen mode

Character set escapes like \w match only ASCII characters whereas named character sets are Unicode aware. You can also use (?u) inline modifier to allow character set escapes to match Unicode characters.

>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]

>> 'fox:αλεπού'.scan(/[[:word:]]+/)
=> ["fox", "αλεπού"]

>> 'fox:αλεπού'.scan(/(?u)\w+/)
=> ["fox", "αλεπού"]
Enter fullscreen mode Exit fullscreen mode

See ruby-doc: Regexp Encoding for other such modifiers and details.

Unicode character sets

Similar to named character classes and escape sequences, the \p{} construct offers various predefined sets that will work for Unicode strings. See ruby-doc: Character Properties for full list and details.

# extract all consecutive letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
=> ["fox", "αλεπού", "eagle", "αετός"]
# extract all consecutive Greek letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
=> ["αλεπού", "αετός"]

# extract all words
>> 'φοο12,βτ_4,foo'.scan(/\p{Word}+/)
=> ["φοο12", "βτ_4", "foo"]

# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>> 'φοο12,βτ_4,foo'.gsub(/\P{L}+/, '')
=> "φοοβτfoo"
Enter fullscreen mode Exit fullscreen mode

Codepoints and Unicode escapes

For generic Unicode character ranges, specify codepoints using \u{} construct. The below snippet also shows how to get codepoints (numerical value of a character) in Ruby.

# to get codepoints from string
>> 'fox:αλεπού'.codepoints.map { |i| '%x' % i }
=> ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]
# one or more codepoints can be specified inside \u{}
>> puts "\u{66 6f 78 3a 3b1 3bb 3b5 3c0 3bf 3cd}"
fox:αλεπού

# character range example using \u{}
# all english lowercase letters
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-\u{7a}]+/)
=> ["fox", "eagle"]
Enter fullscreen mode Exit fullscreen mode

See also: codepoints, a site dedicated for Unicode characters.

\X vs dot metacharacter

Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X to match any character, even if it has multiple codepoints.

>> 'g̈'.codepoints.map { |i| '%x' % i }
=> ["67", "308"]
>> puts "\u{67 308}"
g̈

>> 'cag̈ed'.sub(/a.e/, 'o')
=> "cag̈ed"
>> 'cag̈ed'.sub(/a..e/, 'o')
=> "cod"

>> 'cag̈ed'.sub(/a\Xe/, 'o')
=> "cod"
Enter fullscreen mode Exit fullscreen mode

Another difference is that \X will match newline characters by default.

>> "he\nat".sub(/e.a/, 'ea')
=> "he\nat"
>> "he\nat".sub(/e.a/m, 'ea')
=> "heat"

>> "he\nat".sub(/e\Xa/, 'ea')
=> "heat"
Enter fullscreen mode Exit fullscreen mode

Exercises

For practice problems, visit Exercises.md file from this book's repository on GitHub.

Discussion (0)