loading...

Why we chose Apache Spark for ETL (Extract-Transform-Load)

sumitkumar1209 profile image Sumit Kumar Originally published at Medium on ・6 min read

Credit: Undraw

Even before I joined Postman, my colleagues here were dealing at a scale of handling 6 million users all around the world. The amount of data which needed to be processed to get some meaningful insight was huge. Here is a taste of its scale:

  1. 100+ million requests per day spread across 30+ micro-services
  2. Resting around 10 TB of data
  3. Ingesting around 1TB of monthly internal service logs
  4. 100k+ peak concurrent web socket connections

Back Story:

Some months were in for me at Postman. I got assigned to a project which needs to handle millions of rows in service logs. These spans over GBs of data.

Without a doubt, we could have processed these logs in vanilla code and maybe use some libraries too. But this would incur more cost for operating and maintaining the code base. Moreover, libraries which are not tested rigorously increases the surface area for bugs. All these added libraries and infrastructure results in increased human hours. Then we decided to look for something else.

Idea:

With prior experience in distributed systems, the team and I knew of its advantage and limitations. Keeping that in mind the next step for us was to look for solutions in distributed processing. And don’t forget the Map-Reduce functionality.

Postman believes in a philosophy that human hours are the most valuable resources. We always strive for a managed solution as much as possible. Maintained solution handles upgrades of software and hardware by itself (3rd party). We need to focus only on logic not anything else around it.

We were at AWS Community Day 2018, Bengaluru. If you want to check out the photos, visit here.

For the uninitiated What is Apache Spark?

The Debate:

With the above philosophy in mind we primarily wanted to optimize the following:

i. Time for Development

So which method is faster?

Developing any project in a vanilla code of any programming language

OR

using libraries available in the Apache Spark ecosystem.

An advantage of vanilla code is you are familiar with basic concepts. You could have skills and tricks to do something in a way which makes development faster for you. The caveat here is could you propagate the same learnings to your team members? You might want to have better control over what you want to achieve. Could you be sure that you can transfer the same knowledge and need for control to other developers? A blunt question for you to ask, do they actually need it?

One can argue about a similar set of knowledge in the ecosystem of tools. So then what makes Apache Spark ecosystem or any other tools ecosystem weigh much heavier than vanilla alternatives?

According to my viewpoint

  1. support of a community,
  2. better documentation of any methodology,
  3. system paradigms
  4. and the added advantage of potential chances of learning from other people mistakes

P.S.: With Apache Spark being an open source tool you also don’t lose your control.

ii. Modularisation

In my early years of programming one of my colleague beautifully put a thought in my mind. I can say this actually transformed the way, I write code. I won’t put any effort into explaining this, casually putting it here.

“Code is like poetry.”

I can not put more emphasis on how difficult it is to modularise any code. Also then maintain the same if you are not the only developer on the project. A single developer project enables you to put your thoughts (read opinions) on how you structure the code. It might be good or maybe not. The real test of skills happens when there are more than one contributors to the project. Your modularisation should be consumable without you explaining to each one of them.

With tools such as Apache Spark, the contributed code is very small. Why? Because all the nitty-gritty of core functionality is hidden away in library code. What you have is a very simple small liner of the ETL process.

For e.g. A small but complete ETL process could be summarised as

spark.read.parquet("/source/path) # Extract
    .filter(...).agg(...) # Transform
    .write.parquet("/destination/path") # Load

There is a lot of support for simple readers and writers from the Apache Spark community. This enables you to easily modularise your code and also stick to one paradigm. So what do you get finally? You get a beautiful structure code, which everyone can understand and contribute to.

P.S.: You can always extend the default readers and writers to perfectly match with your requirements.

iii. Maintainability

I have a very strong belief in the power of community. For me, community boils down to something like a league of superheroes fighting for a common goal. Here is a quote to put my thought process in very simple words.

Two is better than one if two act as one.

Why I believe in that is because if one fails, the other one will help him. There can be a debate using this proverb.

Too many cooks spoil the broth

  • Proverb

which means that if there are many people involved in doing the same thing, the final result will not be good. But this happens when things happen behind closed doors in a kitchen.

Open source community has solved this problem very beautifully and proved its mettle. Quality and quantity of open source projects and community are proofs for it. Apache Spark is one of them. This means that its quality, maintainability, and ease of use is much better. At least better than few chefs trying to build vanilla products behind closed doors. No shit, Sherlock!

P.S.: Active open source community means there are a lot of people already maintaining and actually doing the work for you.

iv. Time to Production

Time to development is one thing and actually deploying the code in production is another. Most of the times projects stay in “PoC” mode and never come out of it. I have seen some companies use the same “PoC” in production server. These “PoC”s don’t give much thought on whether they will be able to handle the current traffic and rate at which load increases.

Inherent nature of distributed systems such as Apache Spark makes supporting large scale a cakewalk. These systems are designed solely for the very same reason “to handle scale”.

While building a vanilla system most of the times it is designed to handle current load presuming it won’t increase. This might be true in an ideal world, but we don’t live in one.

P.S.: While I try to vertically scale most of the times but I have done horizontal scaling as well, which is buttery smooth in distributed systems.

v. After production caveats

Two words

Monitoring and Observability.

I won’t go into details of these terms as there is a lot of content around it already. For production deployments, you might hope to “do it and forget it” in an ideal world but again we don’t live in it. Deploying anything in production brings a lot of itsy-bitsy or larger problems. Systems in production should be continuously monitored and be designed for observability. Don’t forget the alerts too which could subside under monitoring. Apache Spark with its web UI and added support from AWS makes it a much better alternative than building custom solutions in vanilla code.

So do you actually want to reinvent the wheel?

P.S.: Probably you don’t. I am not judging.

Conclusion:

To conclude all my blabbering on top, here is a TL;DR version on why we chose to use Apache Spark for ETL

  1. AWS Support (Our primary requirement)
  2. Distributed System (To handle scalability)
  3. Open source (Control)
  4. Community power (Superpower)
  5. Documentation (Ease of on-boarding)

I usually write about Data. Find more posts from me on medium or on the dev.to community.


Posted on by:

sumitkumar1209 profile

Sumit Kumar

@sumitkumar1209

Creative Developer & Designer. I love everything related to Web design & development and Graphic design.

Discussion

pic
Editor guide