DEV Community

Cover image for Parse user input for urls, timestamps & hashtags with RegEX 🧠
benjaminadk
benjaminadk

Posted on

Parse user input for urls, timestamps & hashtags with RegEX 🧠

Video code-along version of tutorial 📽

I used to avoid Regular Expressions aka RegEx at all costs. Aside from not understanding how to use it I didn't see much purpose for it in my code. I suppose, to be fair, RegEx isn't exactly beginner friendly. Nevertheless, I now find myself looking for opportunities to use it. The truth is that RegEx can save a lot of development time and is a powerful tool.

Recently, I have been focused on re-creating parts of YouTube, and I noticed something simple, yet cool, about video descriptions and comments. Users can enter urls, timestamps and hashtags and YouTube will parse the input and transform the text into links. Urls become external links, timestamps are links that seek the current video to a specific spot and hashtags become search terms to find related content.

youtube-description

There are some good tools and sites out there to learn about this. Sometimes just googling regex for <whatever> will bring up some good Stack Overflow. RegExr is really cool. You can create an account as save your expressions into a library of your own. Plus they break down each character and what it does, not to mention a database of community expressions. Regular Expressions Info has more detailed breakdowns of pretty much anything and everything related to RegEx.

Now this tutorial assumes you have already captured and stored the user input. That is the raw text we are parsing. From here we need to address a few things as we process the text into HTML.

  1. Preserve the formatting of text - spacing, line breaks, etc
  2. Fit the text into an HTML element
  3. Parse text for urls, timestamps (HH:MM:SS format) and hashtags
  4. Replace these with appropriate links, target and params if needed
  5. Bonus: set the time of video, perform a search based on hashtag term

⚠ Disclaimer - all code examples will use React and/or JSX syntax and therefore JavaScript

Preserving the format is pretty easy. One option is HTML pre tag. pre is short for pre-formatted I think. 🤦‍♂️

<pre>{description}</pre>
Enter fullscreen mode Exit fullscreen mode

Another option is to use the white-space CSS property set to pre. We might as well use pre-wrap. Otherwise long lines of text will overflow their container.

<div style={{whiteSpace: 'pre-wrap'}}>{description}</div>
Enter fullscreen mode Exit fullscreen mode

Now we need to bust out the big guns 🔫. First we need to find, and somewhat understand the regular expressions involved. Here is a pretty standard expression to find http/s urls. It basically looks for http://anything, but it seems to do the trick. Note the g flag - matches all occurrences and the i flag that ignores case. It can also match ftp and file urls by using the OR operator in the first capture block.

const reUrl = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/gi
Enter fullscreen mode Exit fullscreen mode

The timestamp expression isn't quite as bad. Note that (?:)? sets up capture groups that are optional. [0-5] makes sense because when dealing with HH:MM:SS you won't see 01:90:90, the highest minute or second can be 59. Anyways, this is set up to match MM:SS and :SS which is cool. This allow the user a little more flexibility in what they can use are time links.

const reTime = /\s(?:(?:([01]?\d):)?([0-5]?\d))?:([0-5]?\d)\s/g
Enter fullscreen mode Exit fullscreen mode

Ok, lets get down to the function itself. We are going to leverage the replace method on the String prototype. String.prototype.replace can take RegEx as the first argument and a function as the second. This callback function can receive many arguments, but the first is the matched text itself. This means we can use the original urls/time/hash in our replacement string. The idea is to replace our matches with the appropriate HTML. To keep things simple, we'll start with urls. This process in commonly called the linkify process. Get it? 🧠

function linkify(text) {
    return text.replace(reUrl, url => `<a href="${url}" target="_blank">${url}</a>`)
}
Enter fullscreen mode Exit fullscreen mode

I used an arrow function and returned a template string to save on space. Target set to _blank ensures that this page will be opened in a new window. Template strings you should probably know about by now.

Dealing with the timestamps is a little more advanced. We are going to need a helper function and some additional logic to make them useful. Assume we have a video player, like YouTube, for this example. We want to display the timestamp link in HH:MM:SS format but we need to convert that value to seconds so we can set a search parameter and have a value that we can send to our player - The HTML video element has a property called currentTime which gets/sets the time of the video in...seconds! We also need the value of the url to our player's page on our site.

function HHMMSStoSeconds(str) {
  var p = str.split(':')
  var s = 0
  var m = 1

  while (p.length > 0) {
    s += m * parseInt(p.pop(), 10)
    m *= 60
  }

  return s
}

function linkify(text) {
    const playerUrl = 'http://www.youtube.com/watch'
    return text.replace(reTime, time => {
        const seconds = HHMMSStoSeconds(time)
        return `<a href="${playerUrl}?time=${seconds}">{time}</a>`
    })
}
Enter fullscreen mode Exit fullscreen mode

As a side note I really like the string to seconds function. Its been a while since i used a while loop. 🤓

Now when a user clicks a timestamp link we can implement some tricky logic in our React component to seek the video to the time specified in the link.


class Player extends React.Component {

    componentDidMount() {
        const params = new URLSearchParams(window.location.search)
        const time = params.get('time')
        if(time) {
            this.video.currentTime = time
        }
    }

    render() {
        return <video ref={el=>this.video = el} src={src} />
    }
}
Enter fullscreen mode Exit fullscreen mode

This may look weird because we are used to routing libraries, but it works. Learn about URLSearchParams. Using a ref is also key here. There are a feature of React that gives us access to the underlying DOM Node and all the built in APIs that go with it. React Refs and HTML video/audio DOM... are helpful.

Hashtags work in a very similar way to timestamps. It is up to the developer to decide how to implement them into the UI. YouTube runs a search for anything related to hashtag term. The expression to match hashtags might look something like this.

const reHash = /(?:\s|^)?#[A-Za-z0-9\-\.\_]+(?:\s|$)/g
Enter fullscreen mode Exit fullscreen mode

This one is actually almost understandable. But we can break it down as follows.

(?: // start of non-capture group
\s  // match space character
|   // logical OR
^   // beginning of string
)   // end non-capture group
?   // match 0 or 1 of preceding
#   // match # character
[]  // enclosed character set
A-Z // capital A through Z
a-z // lowercase a through z
0-9 // digits 0 through 9
\-  // \ is an escape character matches -
+   // requires 1 or more match of preceding token
$   // end of string 
Enter fullscreen mode Exit fullscreen mode

Now we can lump everything together into one big function. Of course everyone's needs are different but the following would be something like YouTube. This time I am passing a video object. This is just one way to do it. However, in my implementation I don't see much sense in making timestamp links if the time is greater than the duration of the video. Check out the if/else block, by returning the parameter to the callback function it is as if we ignore that specific match. Worthwhile.

import HHMMSStoSeconds from './above-this'

const reUrl = /(\b(https?):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])/gi
const reTime = /\s(?:(?:([01]?\d):)?([0-5]?\d))?:([0-5]?\d)\s/g
const reHash = /(?:\s|^)?#[A-Za-z0-9\-\.\_]+(?:\s|$)/g
const frontend = 'https://www.youtube.com'

export default function linkify(video) {
  return (
    video.description
      .replace(reUrl, url => `<a href="${url}" target="_blank">${url}</a>`)
      .replace(reTime, time => {
        const secs = HHMMSStoSeconds(time)
        if (secs > video.duration) {
          return time
        } else {
          return `<a href="${frontend}/watch?id=${video.id}&t=${secs}">${time}</a>`
        }
      })
      .replace(
        reHash,
        hash => `<a href="${frontend}/search?term=${hash.replace('#', '').trim()}">${hash}</a>`
      )
  )
}

Enter fullscreen mode Exit fullscreen mode

So if you actually made it this far you for sure learned something. I figured it took me a good part of a day to figure all this stuff out and I had to pull from all kinds of different websites and searches. Why not put it all down in the same place. Naturally, there are probably more efficient or more thorough RegEx out there. But these seem to work well for my use case.

Parser Tutorial

Clone Component Series

My YouTube Channel

Library that does all this for you

Oldest comments (3)

Collapse
 
moopet profile image
Ben Sinclair • Edited

If you use const reHash = /(?:\s|^)?#[A-Za-z0-9\-\.\_]+(?:\s|$)/g then you'll match " #hello" as " #hello", with the space at the start. I see you're using trim() to fix this later in the code.
You could use this instead, which should cover all the bases using \B to match against non-word-boundary characters at the start and \b to match word-boundaries at the end: /\B#[A-Za-z0-9\-\.\_]+\b/g

This means you don't need to do the \s|^ trickery.

"#one two#three #four five #six_seven".match(/\B#[A-Za-z0-9\-\.\_]+\b/g)
// ["#one", "#four", "six_seven"]

EDIT: come to think of it, you don't need to escape the characters inside [] either, even the - if it's the last character. And you can make it case-insensitive with the /i flag.

/\B#[a-z0-9._-]+\b/gi
Final answer.

Wait, no you can improve that by making sure it starts with a letter.

/\B#[a-z][a-z0-9._-]*\b/gi
Final final answer :)

Collapse
 
benjaminadk profile image
benjaminadk

Thanks for the finer points. So in [] or a character class|set the only characters that must be escaped are \ (backslash), ^ and -. And hyphen can be un-escaped if its last. I have to research more on word boundaries. I sort of hacked my hashtag solution because the preceding match would gobble up the space character need with the next. Wow, trying to explain RegEx thought is ridiculous. But yah, it is crazy that one line of code can take this long to understand.

Collapse
 
moopet profile image
Ben Sinclair

You don't need to escape ^ in a character class because it has no ambiguous meaning. Same with $.