Sundeep

Posted on Apr 3, 2019 • Edited on Jan 13, 2021 • Originally published at learnbyexample.github.io

Ruby Regexp Part 8 - Lookarounds

#ruby #regex #ebook #lookarounds

Lookarounds

You've already seen how to create custom character classes and various avatars of special groupings. In this chapter you'll learn more groupings, known as lookarounds, that help to create custom anchors and add conditions within regexp definition. These assertions are also known as zero-width patterns because they add restrictions similar to anchors and are not part of the matched portions. Also, you will learn how to negate a grouping similar to negated character sets and what's special about the \G anchor.

Conditional expressions

Before you get used to lookarounds too much, it is good to remember that Ruby is a programming language. You have control structures and you can combine multiple conditions using logical operators, methods like all?, any?, etc. Also, do not forget that regular expressions is only one of the tools available for string processing.

>> items = ['1,2,3,4', 'a,b,c,d', '#foo 123']

# filter elements containing digit and '#' characters
>> items.filter { |s| s.match?(/\d/) && s.include?('#') }
=> ["#foo 123"]

# modify elements only if it doesn't start with '#'
>> items.filter_map { |s| s.sub(/,.+,/, ' ') if s[0] != '#' }
=> ["1 4", "a d"]

Negative lookarounds

Lookaround assertions can be added in two ways — lookbehind and lookahead. Each of these can be a positive or a negative assertion. Syntax wise, lookbehind has an extra < compared to the lookahead version. Negative lookarounds can be identified by the use of ! whereas = is used for positive lookarounds. This section is about negative lookarounds, whose complete syntax is shown below.

(?!pat) for negative lookahead assertion
(?<!pat) for negative lookbehind assertion

As mentioned earlier, lookarounds are not part of matched portions and do not capture the matched text.

# change 'foo' only if it is not followed by a digit character
# note that end of string satisfies the given assertion
# 'foofoo' has two matches as the assertion doesn't consume characters
>> 'hey food! foo42 foot5 foofoo'.gsub(/foo(?!\d)/, 'baz')
=> "hey bazd! foo42 bazt5 bazbaz"

# change 'foo' only if it is not preceded by _
# note how 'foo' at start of string is matched as well
>> 'foo _foo 42foofoo'.gsub(/(?<!_)foo/, 'baz')
=> "baz _foo 42bazbaz"

# overlap example
# the final _ was replaced as well as played a part in the assertion
>> 'food _fool 42foo_foot'.gsub(/(?<!_)foo./, 'baz')
=> "baz _fool 42bazfoot"

Lookarounds can be mixed with already existing anchors and other features to define truly powerful restrictions.

# change whole word only if it is not preceded by : or --
>> ':cart apple --rest ;tea'.gsub(/(?<!:|--)\b\w+/, 'X')
=> ":cart X --rest ;X"

# extract whole words not surrounded by punctuation marks
>> 'tie. ink east;'.scan(/(?<![[:punct:]])\b\w+\b(?![[:punct:]])/)
=> ["ink"]

# add space to word boundaries, but not at start or end of string
# similar to: gsub(/\b/, ' ').strip
>> 'foo_baz=num1+35*42/num2'.gsub(/(?<!\A)\b(?!\z)/, ' ')
=> "foo_baz = num1 + 35 * 42 / num2"

In all the examples so far, lookahead grouping was placed as a suffix and lookbehind as a prefix. This is how they are used most of the time, but not the only way to use them. Lookarounds can be placed anywhere and multiple lookarounds can be combined in any order. They do not consume characters nor do they play a role in matched portions. They just let you know whether the condition you want to test is satisfied from the current location in the input string.

# these two are equivalent
# replace a character as long as it is not preceded by 'p' or 'r'
>> 'spare'.gsub(/(?<![pr])./, '*')
=> "**a*e"
>> 'spare'.gsub(/.(?<![pr].)/, '*')
=> "**a*e"

# replace 'par' as long as 's' is not present later in the input
# this assumes that the lookaround doesn't conflict with search pattern
# i.e. 's' will not conflict 'par' but would affect if it was 'r' and 'par'
>> 'par spare part party'.gsub(/par(?!.*s)/, 'X')
=> "par sXe Xt Xty"
>> 'par spare part party'.gsub(/(?!.*s)par/, 'X')
=> "par sXe Xt Xty"

# since the three assertions used here are all zero-width,
# all of the 6 possible combinations will be equivalent
>> 'foo_baz=num1+35*42/num2'.gsub(/(?!\z)\b(?<!\A)/, ' ')
=> "foo_baz = num1 + 35 * 42 / num2"

Positive lookarounds

Unlike negative lookarounds, absence of something will not satisfy positive lookarounds. Instead, for the condition to satisfy, the pattern has to match actual characters and/or zero-width assertions. Positive lookaround can be identified by use of = in the grouping. The complete syntax looks like:

(?=pat) for positive lookahead assertion
(?<=pat) for positive lookbehind assertion

# extract digits only if it is followed by ,
# note that end of string doesn't qualify as this is positive assertion
>> '42 foo-5, baz3; x83, y-20; f12'.scan(/\d+(?=,)/)
=> ["5", "83"]

# extract digits only if it is preceded by - and followed by ; or :
>> '42 foo-5, baz3; x83, y-20; f12'.scan(/(?<=-)\d+(?=[;:])/)
=> ["20"]

# replace 'par' as long as 'part' occurs as a whole word later in the line
>> 'par spare part party'.gsub(/par(?=.*\bpart\b)/, 'X')
=> "X sXe part party"

Lookarounds can be quite handy in field based processing.

# except first and last fields
>> '1,two,3,four,5'.scan(/(?<=,)[^,]+(?=,)/)
=> ["two", "3", "four"]

# replace empty fields with nil
# note that in this case, order of lookbehind and lookahead doesn't matter
# can also use: gsub(/(?<![^,])(?![^,])/, 'nil')
>> ',1,,,two,3,,'.gsub(/(?<=\A|,)(?=,|\z)/, 'nil')
=> "nil,1,nil,nil,two,3,nil,nil"

# surround all fields (which can be empty too) with {}
# there is an extra empty string match at end of non-empty columns
>> ',cat,tiger'.gsub(/[^,]*/, '{\0}')
=> "{},{cat}{},{tiger}{}"
# lookarounds to the rescue
>> ',cat,tiger'.gsub(/(?<=\A|,)[^,]*/, '{\0}')
=> "{},{cat},{tiger}"

Capture groups inside positive lookarounds

Even though lookarounds are not part of matched portions, capture groups can be used inside positive lookarounds. Can you reason out why it won't work for negative lookarounds?

# note also the use of double quoted string in replacement section
>> puts 'a b c d e'.gsub(/(\S+\s+)(?=(\S+)\s)/, "\\1\\2\n")
a b
b c
c d
d e

# and of course, use non-capturing group where needed
>> 'pore42 car3 pare7 care5'.scan(/(?<=(po|ca)re)\d+/)
=> [["po"], ["ca"]]
>> 'pore42 car3 pare7 care5'.scan(/(?<=(?:po|ca)re)\d+/)
=> ["42", "5"]

AND conditional with lookarounds

As promised earlier, here's how lookarounds make it simpler to construct AND conditionals.

>> words = %w[sequoia subtle questionable exhibit equation]

# words containing 'b' and 'e' and 't' in any order
# same as: /b.*e.*t|b.*t.*e|e.*b.*t|e.*t.*b|t.*b.*e|t.*e.*b/
>> words.grep(/(?=.*b)(?=.*e).*t/)
=> ["subtle", "questionable", "exhibit"]

# words containing all vowels in any order
>> words.grep(/(?=.*a)(?=.*e)(?=.*i)(?=.*o).*u/)
=> ["sequoia", "questionable", "equation"]

# words containing 'a' and 'q' but not 'n' at the end of the element
>> words.grep(/(?=.*a)(?=.*q)(?!.*n\z)/)
=> ["sequoia", "questionable"]

Emulating positive lookbehind with \K

Some of the variable length positive lookbehind cases can be simulated by using \K as a suffix to the regexp that is needed as lookbehind assertion. \K isn't a zero-width assertion as characters matched are consumed.

# similar to: /(?<=\b\w)\w*\W*/
# text matched before \K won't be part of the matching portion
>> 'sea eat car rat eel tea'.gsub(/\b\w\K\w*\W*/, '')
=> "secret"

# replace only 3rd occurrence of 'cat'
>> 'cat scatter cater scat'.sub(/(cat.*?){2}\Kcat/, 'X')
=> "cat scatter Xer scat"

Here's another example that won't work if greedy quantifier is used instead of possessive quantifier.

>> row = '421,foo,2425,42,5,foo,6,6,42'

# similar to: row.split(',').uniq.join(',')
# possessive quantifier used to ensure partial column is not captured
# if a column has same text as another column, the latter column is deleted
>> nil while row.gsub!(/(?<=\A|,)([^,]++).*\K,\1(?=,|\z)/, '')
=> nil
>> row
=> "421,foo,2425,42,5,6"

Don't use \K with gsub or scan if the string to match after \K can be empty. This is how the regexp engine has been implemented, other libraries like PCRE don't have this limitation. See stackoverflow: \K in ruby for some more details on this topic.

# [^,]*+ can match empty field, so use lookaround instead of \K
>> ',cat,tiger'.gsub(/(?<=\A|,)[^,]*+/, '{\0}')
=> "{},{cat},{tiger}"
>> ',cat,tiger'.gsub(/(?:\A|,)\K[^,]*+/, '{\0}')
=> "{},cat,{tiger}"

# another example with nothing to be matched after \K
>> 'abcd 123456'.gsub(/(?<=\w)/, ':')
=> "a:b:c:d: 1:2:3:4:5:6:"
>> 'abcd 123456'.gsub(/\w/, '\0:')
=> "a:b:c:d: 1:2:3:4:5:6:"
>> 'abcd 123456'.gsub(/\w\K/, ':')
=> "a:bc:d 1:23:45:6"

Variable length lookbehind

The pattern used for lookbehind assertion (either positive or negative) cannot imply matching variable length of text. Using fixed length quantifier or alternations of different lengths (but each alternation being fixed length) is allowed. For some reason, alternations of different lengths inside a group is not allowed. Here's some examples to clarify these points:

>> s = 'pore42 tar3 dare7 care5'

# allowed
>> s.scan(/(?<=(?:po|da)re)\d+/)
=> ["42", "7"]
>> s.scan(/(?<=\b[a-z]{4})\d+/)
=> ["42", "7", "5"]
>> s.scan(/(?<!tar|dare)\d+/)
=> ["42", "5"]

# not allowed
>> s.scan(/(?<=(?:o|ca)re)\d+/)
SyntaxError ((irb):4: invalid pattern in look-behind
>> s.scan(/(?<=\b[a-z]+)\d+/)
SyntaxError ((irb):5: invalid pattern in look-behind

There are various workarounds possible depending upon the use case. Some of the positive lookbehind cases can be solved using \K as seen in previous section, but \K isn't a zero-width assertion. For some cases, you can skip lookbehind entirely and workaround with normal groupings. This works even when you don't know the length of patterns.

>> s = 'pore42 tar3 dare7 care5'

# examples where lookbehind won't give error
# same as: s.scan(/(?<=tar|dare)\d+/)
>> s.gsub(/(?:tar|dare)(\d+)/).map { $1 }
=> ["3", "7"]
# delete digits only if they are preceded by 'tar' or 'dare'
# same as: s.gsub(/(?<=tar|dare)\d+/, '')
>> s.gsub(/(tar|dare)\d+/, '\1')
=> "pore42 tar dare care5"

# examples where lookbehind will give error
# workaround for /(?<=\b[pd][a-z]*)\d+/
# get digits only if they are preceded by a word starting with 'p' or 'd'
>> s.gsub(/\b[pd][a-z]*(\d+)/).map { $1 }
=> ["42", "7"]
# delete digits only if they are preceded by a word starting with 'p' or 'd'
>> s.gsub(/(\b[pd][a-z]*)\d+/, '\1')
=> "pore tar3 dare care5"

However, if you don't know the lengths for negative lookbehind, you cannot use the above workarounds. The next section will show how to negate a grouping, and that helps for some of the variable negative lookbehind cases.

Negated groups and absence operator

Variable length negative lookbehind can be simulated using negative lookahead (which doesn't have restriction on variable length) inside a grouping and applying quantifier to match characters one by one. This also showcases how grouping can be negated in certain cases. Note that this will only work if you have well defined conditions before the negated group.

# match 'dog' only if it is not preceded by 'cat'
# note the use of \A anchor to force matching all characters up to 'dog'
# cannot use /(?<!cat.*)dog/ as variable length lookbehind is not allowed
>> 'fox,cat,dog,parrot'.match?(/\A((?!cat).)*dog/)
=> false
# match 'dog' only if it is not preceded by 'parrot'
>> 'fox,cat,dog,parrot'.match?(/\A((?!parrot).)*dog/)
=> true

# easier to understand by checking matched portion
>> 'fox,cat,dog,parrot'[/\A((?!cat).)*/]
=> "fox,"
>> 'fox,cat,dog,parrot'[/\A((?!parrot).)*/]
=> "fox,cat,dog,"
>> 'fox,cat,dog,parrot'[/\A(?:(?!(.)\1).)*/]
=> "fox,cat,dog,pa"

There's an alternate syntax that can be used for cases where the grouping to be negated is bound on both sides by another regexp, anchor, etc. It is known as absence operator and the syntax is (?~pat).

# match if 'do' is not there between 'at' and 'par'
# note that quantifier is not used, absence operator takes care of it
# same as: /at((?!do).)*par/
>> 'fox,cat,dog,parrot'.match?(/at(?~do)par/)
=> false

# match if 'go' is not there between 'at' and 'par'
>> 'fox,cat,dog,parrot'.match?(/at(?~go)par/)
=> true
>> 'fox,cat,dog,parrot'[/at(?~go)par/]
=> "at,dog,par"

\G anchor

The \G anchor restricts matching from start of string like the \A anchor. In addition, after a match is done, ending of that match is considered as the new anchor location. This process is repeated again and continues until the given regexp fails to match (assuming multiple matches with methods like scan and gsub).

# all non-whitespace characters from start of string
>> '123-87-593 42 foo'.scan(/\G\S/)
=> ["1", "2", "3", "-", "8", "7", "-", "5", "9", "3"]
>> '123-87-593 42 foo'.gsub(/\G\S/, '*')
=> "********** 42 foo"

# all digits and optional hyphen combo from start of string
>> '123-87-593 42 foo'.scan(/\G\d+-?/)
=> ["123-", "87-", "593"]
>> '123-87-593 42 foo'.gsub(/\G(\d+)(-?)/, '(\1)\2')
=> "(123)-(87)-(593) 42 foo"

# all word characters from start of string
# only if it is followed by word character
>> 'cat12 bat pin'.gsub(/\G\w(?=\w)/, '\0:')
=> "c:a:t:1:2 bat pin"

# all lowercase alphabets or space from start of string
>> 'par tar-den hen-food mood'.gsub(/\G[a-z ]/, '(\0)')
=> "(p)(a)(r)( )(t)(a)(r)-den hen-food mood"