How to remember the `m` and `s` modifier for a regular expression in JavaScript?

#javascript #regex #regularexpression #pcre

Sometimes it is difficult to remember the difference between the m and s modifier of a regular expression. Both of them somehow deal with multiple lines.

JavaScript actually helps remember how they are different: JavaScript does not have the s modifier, so to overcome it, we can use [^] or [\s\S], which is to match "all characters":

> "begin 123 peter paul mary \n 456 apple banana end".match(/begin.*end/)
null

> "begin 123 peter paul mary \n 456 apple banana end".match(/begin[^]*end/)
[
  'begin 123 peter paul mary \n 456 apple banana end',
  index: 0,
  input: 'begin 123 peter paul mary \n 456 apple banana end',
  groups: undefined
]

> "begin 123 peter paul mary \n 456 apple banana end".match(/begin[\s\S]*end/)
[
  'begin 123 peter paul mary \n 456 apple banana end',
  index: 0,
  input: 'begin 123 peter paul mary \n 456 apple banana end',
  groups: undefined
]

In the examples above, we are to match a string that start with begin and end with the word end, and characters in between them. In the first example, the . cannot match the newline character. In the second and third example, the [^] and [\s\S] can match any character, including the newline character. The [^] is a character class. It means excluding no character. [^abcde] means exclude the character a to e, and [^] means exclude nothing. [\s\S] is also a character class. It means including all the white space characters and all non-white-space characters, and that means all characters.

So we can remember JavaScript doesn't have the s modifier and [^] and [\s\S] is the solution. Also, in general it is not true to say . can match all characters. It is more accurate to say that the . matches all characters except the "next line" characters. In particular, it won't match the following characters in JavaScript:

U+000A LINE FEED (LF) ("\n")
U+000D CARRIAGE RETURN (CR) ("\r")
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

If we use the u modifier, we can even match the . or [^] to a unicode character:

> [..."😀 设计制作 電影 →⇒⇄↻ ▲ n² x³ ∂ ∫ ∲ hi🐶".matchAll(/./gu)].map(m => m[0])
[
  '😀', ' ',  '设', '计', '制', '作',
  ' ',  '電',  '影', ' ', '→', '⇒',
  '⇄',  '↻',  ' ', '▲', ' ', 'n',
  '²',  ' ',  'x', '³', ' ', '∂',
  ' ',  '∫',  ' ', '∲', ' ', 'h',
  'i',  '🐶'
]
> [..."😀 设计制作 電影 →⇒⇄↻ ▲ n² x³ ∂ ∫ ∲ hi🐶".matchAll(/[^]/gu)].map(m => m[0])
[
  '😀', ' ',  '设', '计', '制', '作',
  ' ',  '電',  '影', ' ', '→', '⇒',
  '⇄',  '↻',  ' ', '▲', ' ', 'n',
  '²',  ' ',  'x', '³', ' ', '∂',
  ' ',  '∫',  ' ', '∲', ' ', 'h',
  'i',  '🐶'
]

Now onto the m modifier. It is about what is considered to be ^ and $. Usually, ^ is to match the beginning of string, and $ is to match the end of string, but what if we have a string that is the content of a text file or HTML file, where there are newline characters \n and you want to consider each separate line as having "beginning and end" like it is a string? The m modifier can let us do that. It makes the ^ match the beginning of the string, or a \n, and makes $ match the end of the string, or a \n. Note that it matches the \n, but it won't "consume" the character -- meaning it needs for the \n to be present, but it won't take \n as part of the matched result. Example:

> [..."ABCD123\nhi9876\nday3".matchAll(/^[a-z]+\d+$/ig)]
[]

> [..."ABCD123\nhi9876\nday3".matchAll(/^[a-z]+\d+$/img)]
[
  [
    'ABCD123',
    index: 0,
    input: 'ABCD123\nhi9876\nday3',
    groups: undefined
  ],
  [
    'hi9876',
    index: 8,
    input: 'ABCD123\nhi9876\nday3',
    groups: undefined
  ],
  [
    'day3',
    index: 15,
    input: 'ABCD123\nhi9876\nday3',
    groups: undefined
  ]
]

> [..."ABCD123\nhi9876\nday3".matchAll(/^[a-z]+\d+$/img)].map(m => m[0])
[ 'ABCD123', 'hi9876', 'day3' ]

In the above cases, it is to match something that begins with one or more alphabets, and end with one or more digits, with nothing in between. If it is a simple string like "ABC123", then it works as expected. It won't work when there are multiple case separated by the newline character \n. So the m modifier can make each individual lines (as separated by newline) in the string match with ^ and $. In the first case, without the m modifier, it won't match anything. In the second case, with the m modifier, it is able to match the ^ and $ even if it is the newline character \n. In the third case, we just take the first element in each array entry so as to get the matches themselves. Note that for img, the i is case-insensitive, and g means global, and img means to use all three modifiers.

Also, it may be good to know that m has a name multiline, and s has a name single line or dotall. But to remember which is which, just remember s is missing in JavaScript as of February 2020 and [^] can be the solution . It is present in Google Chrome, NodeJS, and in Microsoft Edge v79 as it is using Chromium, which is what Google Chrome is based on. If it is Firefox, it doesn't have the support for the s modifier yet, so it is better to use [^] still.

DEV Community

How to remember the `m` and `s` modifier for a regular expression in JavaScript?

Top comments (0)