In the previous position I held, I was entrusted with the fun task of handling the ETL processes for a client. My existence revolved for days at end to extract, transform and load data, from one source to another. The repetition of doing this everyday, as well as the monotonous process of mass data extrapolation, left me exhausted, somewhat frustrated and quite frankly bored out of my skull. I had to learn how to implement the ETL process from scratch, and in those beginning phases, I had very few resources available to me to implement automatic testing.
As with many things, after the first iteration of building and manually testing my ETL process, I thought and hoped it was perfect. However, this was not the case, the most common problems a tester can encounter is exactly what I would find; some records didn’t get loaded, some were malformed or truncated, some were duplicated, others either gave invalid types or values or were transformed incorrectly, just to name a few issues. To identify these problems took forever, as it could either be a fault in the data or, in the ETL process itself. A process had to be built to resolve these issues and then tested again before deploying to production. This was a major interruption to data flows and required extensive rewrites, refactors and redeployment of the entire ETL infrastructure for every change.
The reason for automated or manual tests run in ETL processes are to ensure consistency and integrity, prevent regression, as well as fault detection (find more details on ETL testing here). Even with an automated data pipeline and automated ETL tools you still have to perform testing and validation, therefore it does not result in complete end-to-end automation. In my experience there was a cycle in the testing process I manually had to adhere to. In order to prepare the infrastructure environment and data for my ETL process, code had to be run again to determine the quality of the data pipeline and manual debugging had to be put in place after that. Only when my suspicions were confirmed and the input and output matched, the ETL process was completed and moved into production. This was done under two scopes:
1) Non-functional testing, which characterizes performance tuning, load, and fault tolerance of “dirty data”, and
2) Functional testing, needed for data preparation and problem resolution. This includes unit or component tests, integration tests, and end-to-end testing for example.
The ETL cycle of testing can be time consuming. Each stage of the testing requires a different strategy or type of testing to be done and it also depends on the client requirements or organization standards. I was not able to use automated ETL testing, although I admittedly did not know enough about it at that point. To start, it would have probably taken me another 18 months to figure out how to use it and get all the data ready. It simply would have not have been cost or time effective.
ETL relevance today
Looking back, I can almost not believe how accustomed I am today with real-time data at my fingertips. However, this is not yet the case with traditional ETL processing. We are still some way from delivering a similar “real-time” data-driven solution without ETL processes. The idea of what the etl process represents in its essence is very relevant today and will continue to be. The more a company scales and the bigger the data they can acquire, the more the need also expands for better ways to aggregate information and run the right transformations. The end goal of ETL is, ultimately to enable a company to make the best data-driven business decisions and ETL does that by analyzing and using an extraction of the required information.
Traditional ETL processing and testing is a time consuming task, but the value of the data acquired is invaluable. The tools and paradigms on the other hand, being used by the current market leaders of ETL should be questioned- seeing as the way that they enforce the implementation is archaic in nature. This nature should be re-thought, re-evaluated and exposed to the same kind of rigorous overhaul of practices in the way that for example, blockchain technology changed the way we think about ledgers. Maybe then I would be more willing to undertake such a mammoth event again.