Why URLs are Hard: Path Parameters and urlparse

#python #todayilearned #url

This is a cross-post from my blog Python ♥ HTTP. If you enjoy my content and want it sooner you can follow me via RSS.

Welcome to the first installment of "Why URLs are Hard": a series of stories that I've accumulated from reading a lot about URLs.

We take URLs for granted and mostly think of them as very simple things because of how often we interact with clean and simple URLs like https://example.com. Little do you know there are decades of ancient dark magic that occurred before we ended up with URLs we know and love today.

This story is about finding a mysterious API in Python's urlparse function and discovering a now almost entirely unused URL feature. Come along with me! :)

Comparing urlparse to RFC 3986

I was evaluating urlparse from the urllib.parse module and how it performed compared to other URL parser libraries.

Within the documentation it's mentioned that URLs are parsed according to RFC 3986 which is a set of rules that describe how to segment a URL into different components. Let's take a quick look at that standard to see what parts of a URL we see.

There's a cute little ASCII diagram showing off all the parts of a URL:

 foo://example.com:8042/over/there?name=ferret#nose
 \_/   \______________/\_________/ \_________/ \__/
  |           |            |            |        |
scheme     authority       path        query   fragment

... and then the authority section is further decomposed into:

authority = [ userinfo "@" ] host [ ":" port ]

One of the best parts of reading RFCs is thinking about how much effort people put into the adorable ASCII art :)

Okay, now that we know what to expect let's try out urlparse with the URL from the RFC:

>>> from urllib.parse import urlparse
>>> url = (
... "foo://user:pass@example.com:8042"
... "/over/there?name=ferret#nose"
)
>>> parts = urlparse(url)
>>> parts
ParseResult(
    scheme='foo',
    netloc='user:pass@example.com:8042',
    path='/over/there',
    params='',
    query='name=ferret',
    fragment='nose'
)
>>> parts.hostname
'example.com'
>>> parts.port
8042
>>> parts.username
'user'
>>> parts.password
'pass'

Okay so looks like we have this as a mapping from ParseResult to RFC 3986:

parts.scheme -> scheme
parts.netloc -> authority
- parts.username:password -> userinfo
- parts.hostname -> host
- parts.port -> port
parts.path -> path
parts.params -> ???
parts.query -> query
parts.fragment -> fragment

Notice the ??? in the list? I was confused too. No matter what I put into my URL I couldn't get anything to show up in ParseResult.params.

The documentation for ParseResult.params is "Parameters for last path element" and then isn't mentioned much anywhere else. Googling around is tough too because "params" is Requests way of adding to the query string for the requested URL so most results are about that.

When googling "Path parameters" I found this article from 2008 which pointed to the last paragraph of RFC 3986 Section 3.3 which explains path parameters:

Aside from dot-segments in hierarchical paths,
a path segment is considered opaque by the
generic syntax.  URI producing applications
often use the reserved characters allowed in a
segment to delimit scheme-specific or dereference-
handler-specific subcomponents.  For example,
the semicolon (";") and equals ("=") reserved
characters are often used to delimit parameters
and parameter values applicable to that segment.

So ; and = have special meaning within the path, let's throw those into urlparse and see what happens:

>>> urlparse("http://example.com/a;z=y;x/b;c;d=e")
ParseResult(
    scheme='http',
    netloc='example.com',
    path='/a;z=y;x/b',
    params='c;d=e',
    query='',
    fragment=''
)

Huh, I didn't expect it to pull the values actually outside of the path component. And it looks like it only pulled the params from the last segment, /a;z=y;x/ is untouched. Wonder how many bugs are lurking out there because of this quirk. :)

So if you're relying on URL parsing and directly inspecting the path component make sure you check your implementation and amend it to add f";{result.params}" if params is non-empty. Either that or use a URL parser that doesn't have this quirk like rfc3986

I especially recommend using another library if you're making security decisions based on the URL. A write-up from 2011 details a security issue related to path parameters which an application using ParseResult.path alone would likely also be vulnerable to.

Hope you learned something and stay safe!

DEV Community

Why URLs are Hard: Path Parameters and urlparse

Comparing urlparse to RFC 3986

Latest comments (0)