DEV Community

Matt Kenefick
Matt Kenefick

Posted on

Regex: Fix duplicate slashes without affecting protocol

Let’s say you want to fix a URL that looks like:

https://www.example.com/my/path//to-file.jpg
Enter fullscreen mode Exit fullscreen mode

Using a string replace or a simple regex could incorrectly “fix” the double slashes following the protocol. We can fix that by using a negative lookbehind.

(?<!:)/+
Enter fullscreen mode Exit fullscreen mode

For PHP:

<?php
$url = 'https://www.example.com/my/path//to-file.jpg';
$str = preg_replace('#(?<!:)/+#im', '/', $url);
// https://www.example.com/my/path/to-file.jpg
```



### For Javascript:


```
let url = 'https://www.example.com/my/path//to-file.jpg';
url.replaceAll(/(?<!:)\/+/gm, '/');
// "https://www.example.com/my/path/to-file.jpg"
```

Enter fullscreen mode Exit fullscreen mode

Top comments (2)

Collapse
 
joshcheek profile image
Josh Cheek • Edited

replaceAll is really new and, eg, not available on my version of node (14.16), but the normal replace works fine with the /g flag.


It is usually best to use libraries for things that have well specified structures. The libs will actually parse the string according to the spec and make sure everything is valid. I haven't done this in JS before, but looking at docs, it seems like it should be this (note that I don't have a Windows machine to test it on, I assume the path.posix should do it, but haven't verified):

const path = require("path")

function normalizePath(urlString) {
  const url = new URL(urlString)
  url.pathname = path.posix.normalize(url.pathname)
  return url.toString()
}

console.log(normalizePath("http://localhost:5002///abc:///://"))
Enter fullscreen mode Exit fullscreen mode

Here, I've given it a pretty wonky looking path, but that path is valid. Yeah, you can apparently have colons in the path 🤷 (see pchar here).


All that said, you actually can parse a URI with a regex, but it's a bit of a chore. Eg Ruby's standard library does it:

$ ruby -r uri -e 'p URI::RFC3986_Parser::RFC3986_URI' | wc -c
    1067

$ ruby -r uri -e 'p URI::RFC3986_Parser::RFC3986_URI'
/\A(?<URI>(?<scheme>[A-Za-z][+\-.0-9A-Za-z]*):(?<hier-part>\/\/(?<authority>(?:(?<userinfo>(?:%\h\h|[!$&-.0-;=A-Z_a-z~])*)@)?(?<host>(?<IP-literal>\[(?:(?<IPv6address>(?:\h{1,4}:){6}(?<ls32>\h{1,4}:\h{1,4}|(?<IPv4address>(?<dec-octet>[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]|\d)\.\g<dec-octet>\.\g<dec-octet>\.\g<dec-octet>))|::(?:\h{1,4}:){5}\g<ls32>|\h{1,4}?::(?:\h{1,4}:){4}\g<ls32>|(?:(?:\h{1,4}:)?\h{1,4})?::(?:\h{1,4}:){3}\g<ls32>|(?:(?:\h{1,4}:){,2}\h{1,4})?::(?:\h{1,4}:){2}\g<ls32>|(?:(?:\h{1,4}:){,3}\h{1,4})?::\h{1,4}:\g<ls32>|(?:(?:\h{1,4}:){,4}\h{1,4})?::\g<ls32>|(?:(?:\h{1,4}:){,5}\h{1,4})?::\h{1,4}|(?:(?:\h{1,4}:){,6}\h{1,4})?::)|(?<IPvFuture>v\h+\.[!$&-.0-;=A-Z_a-z~]+))\])|\g<IPv4address>|(?<reg-name>(?:%\h\h|[!$&-.0-9;=A-Z_a-z~])+))?(?::(?<port>\d*))?)(?<path-abempty>(?:\/(?<segment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])*))*)|(?<path-absolute>\/(?:(?<segment-nz>(?:%\h\h|[!$&-.0-;=@-Z_a-z~])+)(?:\/\g<segment>)*)?)|(?<path-rootless>\g<segment-nz>(?:\/\g<segment>)*)|(?<path-empty>))(?:\?(?<query>[^#]*))?(?:\#(?<fragment>(?:%\h\h|[!$&-.0-;=@-Z_a-z~\/?])*))?)\z/
Enter fullscreen mode Exit fullscreen mode
Collapse
 
grahamthedev profile image
GrahamTheDev

simple but effective, do you not need it to be

url = url.replaceAll to work in JS though?