Growing business without data hard. It’s also hard to build data stacks that provide the right data at the right time. ELT is de-facto the standard when it comes to agile data pipelines. Using this process will empower you to move quicker and be more confident at the same time. I’ll tell you how I led the transformation of Dext’s data ingestion to a world-class level, and how you can do the same.
Like most companies, we at Dext don’t make decisions out of the thin air. We rely on A/B testing, behaviour analysis, surveys and many more. To do that, we need data. A lot of data. The whole data journey at Dext started before I joined the company, but I know the story very well. Not to mention, I’ve seen similar evolutions before.
Like most companies, we started querying limited data from a production database replica. We had graph and charts in the admin panel for the simple FAQs. Only people with technical knowledge had permissions to run on-demand reports.
Like most companies, we wanted more and more. We reached a level where reports became so slow that people learned to avoid asking questions at all. We even started hitting database performance issues, which affected our service.
Like many startups out there, we rely on AWS. That’s why it was a no brainer to build our data platform using their services.
We used S3 as a data lake (where we dumped all raw data), Redshift as a warehouse (where we placed it all nice and structured), PySpark as an ETL engine (where the transformations happened), and Data Pipelines for the scheduling. As you can see it’s a standard setup. In fact, I bet many of you have a similar data flow.
At the time I joined the whole “Data lake” as we called it, the project was in a work in progress state, but highly anticipated by the business. Spark itself appeared to be a powerful tool that served us quite well for a couple of years. We realised it is too powerful for the traditional ETL process we had. All we wanted was to extract some CSVs, database records and JSON events, mix and match them, and dump all that into a warehouse. Spark is perfectly capable of doing that, but its true power lies in data streaming, machine learning and so much more. We were trying to kill a mosquito with a bazooka.
We also had one problem related to the whole ETL process concept. We stored the raw data in S3, and every time we wanted to debug something specific or perform a back-fill, we had to search in a large amount of raw data, which was slow and expensive.
Last, but not least, finding real Spark experts in a not so hot technology appeared to be much harder than we thought it would be. At Dext, we hire only world-class talents which appeared to be a problem when it comes to a not so hot technology. We were able to extend the team a bit, but not as much as we wanted.
To summarise our use-case, we needed to store data in an easy to access way, transform it, and make it available for analysis. Our main problem, in short, was that Spark’s too powerful and complicated for that job. Not to mention how slow it is to traverse a data lake, and hard to find colleagues.
Does that sound familiar to you? Then it’s time for a change!
One day while reading random Snowflake (the warehouse we recently switched to) documentation, I found the flatten function that helps us to work with semistructured data in an elegant and easy to use way. That made it all clear!
We have already been using Fivetran, which is probably the best Extract-Load (EL) SaaS to move data from various services to Snowflake. We could use them to load raw JSON events as well.
With a large catalogue of sources Fivetran provides, and the power of Snowflake we had almost everything in place. We just needed a simple way to run our transformation queries.
Thanks to Dylan, we had already been using dbt for a while and have accumulated some good experience with it. All dbt does is enable you to transform data in your warehouse by simply writing select statements. It’s true power lies in jinja macros and the ref function. Here I want to mention that while dbt is a truly awesome tool, I think the best part is it has an outstanding community and I really can’t express how amazed I am by the folks behind it.
Honestly, if you don’t have a data platform in place, you can safely stop reading. Just sign up for a free trial at Snowflake, Fivetran and dbt Cloud and let your imagination free!
We started with a small proof of concept project, which as expected, happened to be really a quick and easy one. We swiftly put an action plan, created a Trello board for the project and started executing it table by table.
The first step was to write the model and the tests (unit, integration and schema) for it. We built the model on a staging environment and looked at the results. While most of the cases we had a one-to-one match, there were occasions, where we found differences and most of the times, those were actually bugs in our Spark code.
Then Matt from the BI team tested if there’s something that breaks the business logic, or if we have missed something. He performed most of his tests against the live production data. Again, there were some differences here and there, but most of the times, those were because of a failed pipeline runs back in the time and data we missed to back-fill.
We left the production module for the production module running for a week or so and if everything was alright we proceeded to the last step. At this stage, we were quite confident with the state of the pipeline. We had to change a single line of code. Believe it or not, but none of our “internal clients” noticed the change. We didn’t have any service downtimes or data inconsistencies.
It looks like a lot of work, but Kami and I managed to go through the whole process in a few months. We had a completely new process, that was not just much easier and faster, but also, more reliable in every aspect.
Obviously, not everything is flowers and roses. The price we had to pay for all these services was increasing every week, for the first few months. Although, paying for services is much cheaper than paying for people, we were not happy with our Snowflake expenses. We spent a few more weeks on optimisations (this one is another interesting story), and now we pay even less than what we were paying in the beginning. The most significant win is not the price, but the fact that anyone with basic SQL knowledge can now understand our process and contribute to the project. If we can do it, you can do it. 😉
The low storage costs combined with the flexibility, power and popularity of SQL makes ELT the way to go for most data engineering processes nowadays. ELT allows data teams to incorporate modern software engineering practices - Agile practices, version control, working in teams, CI/CD and everything else you have used to.
Is ELT a good process for your use-case? What problems do you have with your data stack? Share your thoughts in the comments below, or ping me.