In a distributed system, we often need to store data that we do not own. We might use it as a unique identifier across domains or the system we do own needs to proxy it to yet another service.
+----+--------------+----------------------+
| id | name | backend |
+----+--------------+----------------------+
| 42 | Marley Spoon | elixir, ruby, python |
+----+--------------+----------------------+
{
"id": 42,
"name": "Marley Spoon",
"backend": ["elixir", "ruby", "python"]
}
+------------+--------------+----------------------+
| company_id | company_name | backend |
+------------+--------------+----------------------+
| 42 | Marley Spoon | elixir, ruby, python |
+------------+--------------+----------------------+
But here's the problem: as engineers, we look at _id
fields and immediately think of it as integers. However, the consuming service has no control over the data it receives and the data type is only assumed.
If you use that distributed ID field as a local foreign key: some external system controls the value and an unforeseen change might break our setup.
Identification
It has proven valuable to us to use the pattern *_identifier
to indicate that…
- it is some kind of a unique identifier
- some other system has control over it
If the *_identifier
value is to be stored, it should always be saved as a string type. Almost anything can be coerced into a string, and that way we guarantee that the origin system can choose whatever they want for their unique identifier.
+--------------------------------------+--------------+----
| company_identifier | company_name | …
+--------------------------------------+--------------+----
| 328129ae-df4e-4168-94d3-2572b4b343ef | Marley Spoon | …
+--------------------------------------+--------------+----
Payload
If the system exposing the data is controlled by your organisation, we can support this at the source.
{
"company_identifier": "328129ae-df4e-4168-94d3-2572b4b343ef",
"company_name": "Marley Spoon",
"backend": ["elixir", "ruby", "python"]
}
Summary
- Use
*_identifier
instead of*_id
fields for externally owned data - Prefer a string type over integer for
*_identifier
values - Avoid a 1:1-map of your persistence model to your external API
Top comments (1)
While I understand the motivation for this, I would disagree. First and foremost, we have the data size. Using a non-integer primary key will most likely increase your database size very fast and by a lot. Why? Because of foreign keys. A small table of 100 records could be serving dozens of foreign keys in tables that hold tens of millions of records. If we used a tinyint (SQL Server type for 1 byte), we could easily cover the 100 records, but if we were to use, say, a 5-letter primary key, we are now forcing foreign keys to consume 5 times more bytes. Multiply that by the number of records and you have a mega waste of storage space, which in turn puts pressure on the database server's RAM: Now less data can be stored in RAM, making queries slower.
The second one is performance around the clustered index: Indexing and ordering numbers is faster than indexing and ordering strings.
Finally, the "_id evokes number in people's mind" argument is something that is easily corrected: Don't assume a data type unless the company that provides the data has dictated a column-naming scheme that allows assumptions around the column's data type.
Where I work, for example, a suffix is mandated that most of the time directly relates to the data type. For example,
_dtm
for date/time values, orqty
for quantities (would be a real number). Personally, I hate this kind of conventions. Whoever came up with this naming convention has never coded a backend in their lives.