John Lafleur for Airbyte

Posted on Sep 24, 2020 • Updated on Oct 11, 2020

How We Can Commoditize Data Integration Pipelines

#datascience #database #devops #opensource

Most engineers in their professional life will have to deal with data integrations. In the past few years, a few companies such as Fivetran and StitchData have emerged for batch-based integrations, and Segment for event-based ones. But none of these companies have solved the problem of data integrations, which becomes more and more complex with the growing number of B2B tools that companies use.

We don’t think they will ever be able to solve the data integration problem. You might think this is because they are cloud-based and closed source. But fundamentally, we think it boils down to the fact that they don’t aspire for data integrations to become a commodity.

However, if you ask engineers, most of them believe data integration pipelines will become a commodity within the next 5 years. And that is our vision at Airbyte.

But before we tell you how we intend to commoditize data integration pipelines, let’s review the limitations of current offers.

Why is data integration not yet commoditized?

1. Limited number of pre-built maintenance-free connectors

When you are closed-source, you need to build and maintain all the integration connectors by yourself. The issue is that this is A LOT of work. It took 6 years for Fivetran to reach 150 connectors that they must maintain every day. And when you consider there are 5,000+ tools in the marktech industry, you understand they will never be able to cover the long tail by themselves.

So what happens? Since we started working on this project, we have held interviews with 40 different companies, including a lot of Fivetran and StitchData customers. And a large majority of them had to build some integrations (possibly with Airflow) by themselves to cover the connectors they needed and that were not supported by those tools.

In the end, you still have data engineering teams working a lot on building and maintaining integration pipelines, while their expertise could be best leveraged elsewhere.

2. Pricing of cloud-based solutions indexed on volume

Another significant issue with the existing solutions is that their pricing is indexed on the volume of data transferred. Because of this, teams need to be careful how they use the connectors. It is super frustrating to have a solution for your integration needs, but be unable to use it the way you need to make your life easier because of a pricing consideration. This is the opposite of what a commodity is supposed to be.

3. Data security and privacy in the Enterprise world

There are 2 things that you can be sure will keep growing in the next decades.

Companies will leverage more and more data.
Companies will need to consider more and more data security and privacy, especially for enterprises.

A lot of enterprises have already stopped using 3rd-party cloud-based solutions for security reasons. Those that still use them will all require a lengthy security and privacy compliance process that will last at least 4 months. This cripples internal teams and keeps them from moving forward. A commodity should be easily accessible, and this is currently not the case.

What does commoditizing data integrations mean?

Here is the world we envision at Airbyte in 5 years:

The long tail of connectors should be largely addressed.
It should be super easy to build a new connector.
There should be built-in scheduling, orchestration, and monitoring for all used connectors.
There should be an auto-upgrading mechanism for connectors, so they are maintenance free for data engineering teams.
Connectors should be directly in your own cloud, to give companies full control over their data. Connectors could be an extension of your own data infrastructure, as some portability superpowers.
There shouldn’t be any cost indexed to the volume of data transferred through the connectors (apart from CPU and egress).

Only then can we consider data integration pipelines a solved problem and a commodity.

How can we achieve this vision?

1. Open-sourcing all data integration pipelines with a MIT license

As mentioned before, data integration pipelines entail a lot of maintenance work. Every tool will see a change in its schemas once in a while. The only way to cover the long tail of connectors is to have a large community of maintainers. But unless you are working for a company whose product is those integration pipelines, you only maintain what you use. That’s why the only way to do that is by open-sourcing those connectors for the larger good through an MIT license.

2. Making building new integrations trivial

If ever it wasn’t much simpler to build a new integration using that open-source project, in comparison to building it on the side by yourself, the project would have a harder time finding contributors. The vision would be flawed.

That’s why we are focused on making building new integrations trivial. Fortunately, our team has been building data integration pipelines for the past 23 years accumulated, processing more than 100TB of data every day through more than 1,000 integrations. So we know how to build a level of abstraction that will make things easier.

And it goes without saying that Airbyte will be providing scheduling and orchestration automatically for your new integrations. Indeed, these 2 things are essential to most teams.

3. Built-in scheduling, orchestration, monitoring, and upgrading

In addition to scheduling and orchestration, there are 2 other things we need to provide so connectors are well-maintained throughout the repository: monitoring and a great upgrading experience.

Our monitoring needs to give you detailed logs of any error during the data replications, so that you can easily debug by yourself or report an issue to the community, so other contributors can solve it for you.

Given that there will be a lot of schema updating for all tools, teams will need to upgrade to the latest version of the repository pretty often in order to ensure they get the updated schemas.

4. Expanding to all types of integrations

At the beginning, Airbyte will be focused on batch data replication from 3rd-party tools and databases to warehouses. But nothing prevents us from expanding towards data syncing using warehouses as sources to other destinations, in the near future. For instance, a use case could be if your marketing team wants to send back the data to your ad platforms, so it can better optimize the campaigns. Another use case could be syncing the consolidated data back to your CRM.

And later on, we could address the event-based data integrations, a la Segment. Indeed, the technology will be very close to the connectors we will have already built with the community. This would give companies full control over their data in an effortless way.

5. Enabling other data engineering work – transformation, etc.

Being open-sourced enables us to go faster and deeper. Compare Gitlab to GitHub, for instance. Gitlab was able to cover a lot more of the value chain. We have that in mind as well with Airbyte. For instance, we are often asked about what we will provide in terms of data transformation.

6. Fulfilling the enterprise requirements with privacy compliance and role management

Last but not least, Airbyte will need to address the requirements of enterprises, too. This includes:

security and privacy compliance features
data quality monitoring features
role and user access management, SSO

Without these, it will be harder for enterprises to embrace the open-source technology. This is the part that we think we will sell in a source-available enterprise edition.

What’s Airbyte’s timeline on this?

We’re just getting started on our vision today. We will focus the next year on nailing batch data replication to warehouses. We hope to have at least 50 connectors by the end of 2020, and to be on par with Fivetran by Q2 2021. But we can only achieve that with the help of the community.

Right now, you can download our project and self-host it. We built a UI to allow anyone to define their connections and start consolidating data in minutes, just like Fivetran and StitchData.

In order to accelerate the release, we built our own version of scheduler so you can get started quickly on a single host. We will very soon integrate with Airflow and Kubernetes so you can dispatch sync tasks across your cluster.

Today, our MVP supports BigQuery and Stripe (we wanted to launch fast and get the community feedback as early as possible). We will add many more sources and destinations in the coming weeks.

Give it a spin: https://github.com/airbytehq/airbyte/. Let us know what you think, and don’t hesitate to star the project if you like our vision to commoditize data integration pipelines!

DEV Community