Why we don’t use Spark

#python #spark #googlecloud #bigdata

Big Data & Spark

Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.

For most data engineers, Spark is the go-to when requiring massive scale due to the multi-language nature, the ease of distributed computing or the possibility to stream and batch. The many integrations with different persistent storages, infrastructure definitions and analytics tools make it a great solution for most companies.

Even though it has all these benefits, it is still not the holy grail. Especially if your business is built upon crunching data 24/7.

At OTA Insight, our critical view on infrastructure made us choose to go a different route, focused on our needs as a company, both from a technical, a people perspective and a long term vision angle.

Humble beginnings

Early on you have only 1 focus: building a product that solves a need and that people want to pay for, as quickly as possible. This means that spending money on things that accelerate you getting to this goal - is a good thing.

In the context of this article this means: you don’t want to spend time managing your own servers, or fine-tuning your data pipeline’s efficiency. You want to focus on making it work.

Specifically, we heavily rely on managed services from our cloud provider, Google Cloud Platform (GCP), for hosting our data in managed databases like BigTable and Spanner. For data transformations, we initially heavily relied on DataProc - a managed service from Google to manage a Spark cluster.

Managing managed services

Our first implementation was a self-hosted Spark setup, paired with a Kafka service containing our job-queue. This had clear downsides and in hindsight we don’t consider it managed. A lot of side-developments had to be done to cover all edge-cases of the deployment & its scaling needs. Things like networking, node failures and concurrency should be investigated, mitigated and modelled. This would have put a heavy strain on our development efficiency. Secondly, pricing of running a full Spark cluster with a 100% uptime was quite high and creating auto-scaling strategies for it was quite hard. Our second implementation was migrated to use the same Kafka event stream that streamed workload messages into the Spark DataProc instances instead of the initially self-hosted Spark instance.

The Kafka-Dataproc combination served us well for some time, until GCP released its own message queue implementation: Google Cloud Pub/Sub. At the time, we investigated the value of switching. There is always an inherent switching cost, but what we had underestimated with Kafka is that there is a substantial overhead in maintaining the system. This is especially true if the ingested data volume increases rapidly. As an example: the Kafka service requires you to manually shard the data streams while a managed service like Pub/Sub does the (re)sharding behind the scenes. Pub/Sub on the other hand also had some downsides, e.g. it didn’t allow for longer-term data retention which can easily be worked around by storing the data on Cloud Storage after processing. Persisting the data and keeping logs on the interesting messages made Kafka obsolete for our use case.

Now, as we had no Kafka service anymore, we found that using DataProc was also less effective when paired with Pub/Sub relative to the alternatives. After researching our options regarding our types of workloads, we chose to go a different route. It is not that DataProc was bad for our use cases, but there were some clear downsides to DataProc and some analysis taught us that there were better options.

First, DataProc, at the time, had scaling issues as it was mainly focussed on batch jobs while our main pipelines were all running on streaming data. With the introduction of Spark Streaming, this issue was alleviated a bit, though not fully for our case. Spark Streaming still works in a (micro-)batched way under the hood, which is required to conform to the exactly-once delivery pattern. This gives issues for workloads that do not have uniform running times. Our processors require fully real-time streaming, without exactly-once delivery, due to the idempotency of our services.
Secondly, the product was not very stable at the time, meaning we had to monitor quite closely what was happening and spent quite some time on alerts. Lastly, most of our orchestration & scheduling was done by custom written components, making it hard to maintain and hard to update to newer versions.

Building for the future

It was clear we needed something that was built specifically for our big-data SaaS requirements. Dataflow was our first idea, as the service is fully managed, highly scalable, fairly reliable and has a unified model for streaming & batch workloads. Sadly, the cost of this service was quite large. Secondly, at that moment in time, the service only accepted Java implementations, of which we had little knowledge within the team. This would have been a major bottleneck in developing new types of jobs, as we would either need to hire the right people, or apply the effort to dive deeper in Java. Finally, the data-point processing happens mainly in our API, making much of the benefits not weigh up against the disadvantages. Small spoiler, we didn't choose DataFlow as our main processor. We still use DataFlow within the company currently, but for fairly specific and limited jobs that require very high scalability.

None of the services we researched were an exact match, each service lacked certain properties. Each service lacks something that is a hard requirement to scale the engineering effort with the pace the company is and keeps growing with. At this point, we reached product-market fit and were ready to invest in building the pipelines of the future. Our requirements were mainly keeping the development efficiency high, keeping the structure open enough for new flows to be added, while also keeping the running costs low.

As our core business is software, keeping an eye on how much resources this software burns through is definitely a necessity. Taking into account the cost of running your software on servers can make the difference between a profit and a loss and this balance can change very quickly. We have processes in place to keep our bare-metal waste as low as possible without hindering new developments, which in turn gives us ways to optimise our bottomline. Being good custodians of resources helps us keep our profit margins high on the software we provide.

After investigating pricing of different services and machine types, we had a fairly good idea of how we could combine different services such that we had the perfect balance between maintainability and running costs. At this point, we made the decision to, for the majority of our pipelines, combine Cloud Pub/Sub & Kubernetes containers. Sometimes, the best solution is the simplest.

The reasoning behind using Kubernetes was quite simple. Kubernetes had been around a couple of years and had been used to host most of our backend microservices as well as frontend apps. As such, we had extensive knowledge on how to automate most of the manual management away from the engineers and into Kubernetes and our CI/CD. Secondly, as we already had other services using Kubernetes, this knowledge was quickly transferable to the pipelines, which made for a unified landscape between our different workloads. The ease of scaling of Kubernetes is its main selling point. Pair this with the managed autoscaling the Kubernetes Engine gives and you have a match made in heaven.

It might come as a surprise, but bare-metal Kubernetes containers are quite cheap on most cloud platforms, especially if your nodes can be pre-emptible. As all our data was stored in persistent storages or in message queues in between pipeline steps, our workloads could be exited at any time and we would still keep our consistent state. Combine the cost of Kubernetes with the very low costs of Pub/Sub as a message bus and we have our winner.

Building around simplicity

Both Kubernetes and Pub/Sub are quite barebones, without a lot of bells & whistles empowering developers. As such, we needed a simple framework to build new pipelines fast. We dedicated some engineering effort into building this pipeline framework to the right level of abstraction, where a pipeline had an input, a processor and an output. With this simple framework, we've been able to build the entire OTA Insight platform at a rapid scale, while not constricting ourselves to the boundaries of certain services or frameworks.

Secondly, as most of our product-level aggregations are done in our Go APIs, which are optimised for speed and concurrency, we can replace Spark with our own business logic which is calculated on the fly. This helps us move fast within this business logic and helps keep our ingestion simple. The combination of both the framework and the aggregations in our APIs, create an environment where Spark becomes unnecessary and complexity of business logic is spread evenly across teams.

Summary

During our growth path, from our initial Spark environment (DataProc) to our own custom pipelines, we've learned a lot about costs, engineering effort, engineering experience and growth & scale limitations.
Spark is a great tool for many big data applications that deserves to be the most common name in data engineering, but we found it limiting in our day-to-day development as well as financials. Secondly, it did not fit entirely in the architecture we envisioned for our business.
Currently, we know and own our pipelines like no framework could ever provide. This has led to rapid growth in new pipelines, new integrations and more data ingestion than ever before without having to lay awake at night pondering if this new integration would be one too many.

All in all, we are glad we took the time to investigate the entire domain of services and we encourage others to be critical in choosing their infrastructure and aligning it with their business requirements, as it can make or break your software solutions, either now, or when scaling.

Want to know more? Come talk to us!