DEV Community

Discussion on: Parse user input for urls, timestamps & hashtags with RegEX 🧠

Collapse
 
moopet profile image
Ben Sinclair • Edited

If you use const reHash = /(?:\s|^)?#[A-Za-z0-9\-\.\_]+(?:\s|$)/g then you'll match " #hello" as " #hello", with the space at the start. I see you're using trim() to fix this later in the code.
You could use this instead, which should cover all the bases using \B to match against non-word-boundary characters at the start and \b to match word-boundaries at the end: /\B#[A-Za-z0-9\-\.\_]+\b/g

This means you don't need to do the \s|^ trickery.

"#one two#three #four five #six_seven".match(/\B#[A-Za-z0-9\-\.\_]+\b/g)
// ["#one", "#four", "six_seven"]

EDIT: come to think of it, you don't need to escape the characters inside [] either, even the - if it's the last character. And you can make it case-insensitive with the /i flag.

/\B#[a-z0-9._-]+\b/gi
Final answer.

Wait, no you can improve that by making sure it starts with a letter.

/\B#[a-z][a-z0-9._-]*\b/gi
Final final answer :)

Collapse
 
benjaminadk profile image
benjaminadk

Thanks for the finer points. So in [] or a character class|set the only characters that must be escaped are \ (backslash), ^ and -. And hyphen can be un-escaped if its last. I have to research more on word boundaries. I sort of hacked my hashtag solution because the preceding match would gobble up the space character need with the next. Wow, trying to explain RegEx thought is ridiculous. But yah, it is crazy that one line of code can take this long to understand.

Collapse
 
moopet profile image
Ben Sinclair

You don't need to escape ^ in a character class because it has no ambiguous meaning. Same with $.