Why IMDb IDs cannot be trusted and why not to use padded IDs

#webdev #programming

We can all agree that IDs should be unambiguous, i.e. if I told you to fetch a resource "1", I expect to get the same resource every time. I also expect that I can store a reference to that same resource by storing this ID and use that reference to fetch the same resource at a later date.

At a surface, IMDb is no different, e.g. If you search for 'A Beautiful Day In The Neighborhood (2019)', you will discover https://www.imdb.com/title/tt03224458/. tt03224458 is the unique title identifier. If you bookmark this URL, you will be able to find the same title on the website at a later date.

However, one thing that is unique about IMDb IDs is that their IDs are padded, i.e.

All of the above will fetch the same resource. Presumably, this is done for historic/ aesthetic reasons. Back when IMDb just started, it was just a bunch of text files with a bash script used to search the database. My guess is that ID padding was common way at a time to align data in columns. Unfortunately, this has consequences in the modern, interconnected days.

Other services may refer to IMDb titles, and other services may refer to those references, etc. No one in this chain is aware of the ID semantics, i.e. that there is no difference between tt0000001 and tt0000000000000000001 (and that tt1, tt01, [..], tt000001 are for some reason invalid). This includes services like TMDb and WikiData, e.g. WikiData can be used to find IDs linking to the title on different websites.

SELECT
  ?ROTTEN_TOMATOES_ID
  ?METACRITIC_ID
  ?TMDB_ID
WHERE
{
  ?item wdt:P345 "tt3224458" .
  OPTIONAL {
    ?item wdt:P1258 ?ROTTEN_TOMATOES_ID .
  }
  OPTIONAL {
    ?item wdt:P1712 ?METACRITIC_ID .
  }
  OPTIONAL {
    ?item wdt:P4947 ?TMDB_ID.
  }
}

Result:

{
  "head": {
    "vars": [
      "ROTTEN_TOMATOES_ID",
      "METACRITIC_ID",
      "TMDB_ID"
    ]
  },
  "results": {
    "bindings": [
      {
        "TMDB_ID": {
          "type": "literal",
          "value": "501907"
        }
      }
    ]
  }
}

As you can guess, the same query is going to return no results when ID is entered using tt03224458 or any other than tt3224458 notation.

The lesson here is that as your service grows in importance, other services will depend on referencing resources in your services as an authority, and you should plan your IDs to be immutable.

In terms of what IMDb could do to fix this, I see two-step solution:

Accept IDs without 0 as valid IDs, i.e. https://www.imdb.com/title/tt1/ should be a valid URL/ ID.
Redirect padded notations to those without padding.

Overtime, this would fix the ambiguity.