I have worked in the industry since 2014, first as a software developer and from 2018 on as a data engineer. Especially nowadays I have to work a lot with datetimes, and have come to realize that they are one of the hardest things in my job. I thought I'd write a series about datetimes from a data engineer's point of view. This is the first part of the series.
As a data engineer, one of my responsibilities is to build data pipelines. Multiple internal and external systems are producing data that we want to import and use. Since we cannot control the source systems and how they have designed their systems, we need to adjust to the decisions they have made. This means we need to, almost always, do some transformations to the data and the datetimes in it, before inserting the data into a database.
There is a standard, called ISO 8601, which standardizes presenting date and time. It uses the Gregorian calendar and 24-hour clock and the purpose of the standard is that the dates and times are unambiguous. For example, 01/11/19 can mean either November 1st, 2020, or January 11th, 2020, so it is good to have a standard. I'm happy that I haven't met datetimes that are not following the standard, since it would make all of it even harder. This post doesn't aim to discuss the complete standard, but rather the formats I've seen in my everyday work.
Although there is a standard, it doesn't mean that handling datetimes would be simple. Although ISO 8601 defines a standard, there are still multiple formats available to use. Basically in the standard Y stands for a year, M stands for month and D stands for day. This means that YYYY-MM-DD is a datetime with a four-digit year, two-digit month, and two-digit day of the month. An example of such date could be for example 2021-01-20.
In the standard h stands for hour, m stands for minute and s stands for second. Using those, time can be represented for example as hh:mm:ss. This is a simple case. There might be fractions of a second, too. When there are fractions of a second, they are marked with s. They are separated from seconds with comma or dot, so the format becomes hh:mm:ss.sss. This is not how the time is always presented, though. The separator : is optional, so it could be hhmmss as well. Here's a table to summarize what I just explained:
|D||Day of the month|
|s||Fractions of a second|
Usually, datetimes also include timezone information. ISO 8601 represents timezone as local time, UTC, or offset from UTC. The simplest case is UTC (Coordinated Universal Time). In this case it is represented as Z at the end of timeformat. An example of this is 23:59:59Z.
Then we have an offset from UTC. UTC is the same time as GMT (Greenwich Mean Time) which means it has a value of 0. The offset can be either positive or negative, meaning hours and minutes less or more than in zero timezone. The most common ways I've seen it is hh:mm or just hh. In this case, an example of time with offset is 23:59:59+02 or 23:59:59+02:30. Without : separator it could be 235959+0230.
The most tricky case is the one that doesn't have any timezone information. The time can then be assumed to be a local time, but it's a big assumption, and sometimes it is not clear what the local time is in the source system.
Finally, we can put the date and time information together. Usually separator T is used to separate date and time. In the older version of ISO 8601, it was permitted to omit T, but the decision was retracted later. I have seen both used, though. So to put everything together, these are few examples of how the source system can represent datetime:
|WITH SEPARATOR||WITHOUT SEPARATOR|
So this was the first part of the series. After you've identified the date format that incoming data uses, you can parse it using a programming language of your choice. In the next part I'm going to show how to parse and manipulate datetimes with Clojure.