John Lafleur for Airbyte

Posted on Oct 11, 2020 • Updated on Oct 13, 2020 • Originally published at airbyte.io

Solving Data Integration: The Pros and Cons of Open Source and Commercial Software

#opensource #database #datascience #devops

There was an awesome debate on DBT’s Slack last week discussing mainly about 2 things:

The state of open-source alternatives to Fivetran
Whether an open-source (OSS) approach is more relevant than a commercial software approach in addressing the data integration problem.

If you’re already on DBT’s Slack, here is the thread’s URL. Even Fivetran’s CEO was involved in the debate.

In this article, we want to discuss the second point and go over the different points mentioned by each party. The first point will come in another article. It’s more relevant to discuss whether an OSS approach makes sense before drilling down into the different alternatives.

We’ll go over the main challenges that companies face and see which approach fits best. We’ll call “commercial companies” the ones with a commercial software product, and “OSS companies” the ones with an open-source approach.

TL;DR

Here is a table summarizing the points mentioned in the debate.

To better understand this table, we invite you to read the list of challenges each approach faces below.

1. Having a large number of high-quality well-maintained pre-built connectors

This might be the most challenging part for the open-source approach, but there are actually choices that can make an OSS approach even stronger than a commercial one on that matter.

Commercial approach

In this case, a company supports a limited number of connectors (the most used ones) and actively maintains them. They know when there is a schema change and when they need to update the connector, and can be pretty responsive on that, if the organization is responsive and scales well.

However, the more connectors, the more difficult it is for a commercial company to keep the same level of maintenance across all connectors. In an ideal world, the organization will grow linearly with the number of connectors. But most often, there are inefficient processes, so every organization will reach a limit. The more efficient they are, the higher the limit.

Open-source approach

When you look at Singer.io, you would think that an OSS approach is at a serious disadvantage here. But Singer.io has actually made several significant poor choices here:

There is an absence of standardization / enforcement of protocol, and developers just add whatever they want in their implementation and messages.
Integrations live by themselves in their own repos. So each connector can be seen as an individual open-source project. This does not help when building a community.
There is no real ownership, nor direction for the project.

So can we actually achieve a large number of high-quality well-maintained connectors with an open-source approach? The answer is probably, but only if the following conditions are met:

One single repo for the whole project and all connectors. This will help consolidate the developments, and it will unify the community behind one single project and one vision. This means that the project and community can be opinionated about what a connector should be, and how it should be built.
Detailed logs for all users, with the ability to easily surface the issue on the GitHub project to the whole community.
A separation between extract-load and transformation. Users would be able to use the connector without any normalization, so they use their own transformation processes. By doing so, we can more easily identify where failures happen, and how to solve them. Plus, it would make it easier to support a lot more integrations faster.

A last note here: when users use a connector on prod and see it fail, they will be able to fix it without waiting for external support teams (as for commercial companies), and the whole community will be able to benefit from the fix.

2. Addressing unique corner cases and unique needs

Every company is opinionated about the way it wants to use a connector; each has specific needs.

Commercial approach

Commercial companies have a ROI-driven relationship with connectors. Any additional work on any edge case will need to have some ROI, or the company will have a hard time scaling. That also means that the commercial company will most likely have very high quality connectors for the most used ones. But it can’t afford to keep that level of quality across all integrations, and thus won’t be addressing their corner cases. With their finite resources, it will always be a trade-off. Commercial companies can’t address the long tail of integrations and can’t meet all unique needs.

Furthermore, in the case a client’s needs are not met, the client won’t be able to customize the existing connector to their needs. That is the limitation of this approach.

Open-source approach

It is actually not obvious that an OSS approach would easily address all unique corner cases. Contributors build the connectors with their use cases and needs in mind; they won’t be building a high quality connector to address most companies’ needs.

However, there are ways to avoid that:

Setting standards and enforce protocols, with one single platform / project, as mentioned in the first point.
Decoupling extract-load (EL) and normalization. A lot of the customization can be done directly in normalization, but a lot of the EL is common to everyone. In that case, everybody can benefit from the EL part and use their own transformation tools to address their own needs.
Enabling contributors to build their connectors using the language of their choice by running connectors as Docker containers

Being open source lets any engineer adapt any pre-built connectors to their needs, so every company will be able to address their own needs. The question is more about how much effort would be needed for each of their cases.

3. Pricing model indexed on usage

All existing commercial products have their pricing indexed in some way on the volume of data transferred and the number of connectors used. That means that teams cannot use those services without always thinking about costs.

On the other hand, open-source products will base their pricing on the features used and not the volume of data as it is self-hosted. So once adopted, teams do not have to worry about costs and can use all connectors freely.

4. Debugging autonomy

This is an easy one. When dealing with a commercial product, if you face an issue, the only thing you can do is contact their support, and hope that your issue will be prioritized in the list of issues the company’s engineering team must resolve. You won’t be able to know when the issue will be fixed. It can take hours at best, or months.

With open source, you don’t need to wait on anybody. You can fix the integration yourself. Having this ability is important if the integration is vital to your business.

5. Getting the internal resources

By resources, we mean either budget or the engineering time to work on the integrations. At Airbyte, we don’t like the word resources when talking about people. But, in this context. it is the only common one we could think of.

The truth is that in a lot of enterprises, it can be easier to justify hiring a new employee than getting even a modest budget approved for an external vendor. Depending on the open-source project, most or all of the job could be already done. That’s one of the reasons why most companies use open-source technologies rather than external vendors.

6. Easy to use by any teams (analysts, finance teams, etc.)

Once consolidated, the data is used by different teams: business intelligence / analysts, data scientists, finance teams, customer success, product or marketing team, etc. That is the beauty of products like Fivetran: anybody can use them and anyone can add more data.

One would think that ease of use is exclusive to commercial products, but that is actually not true. For instance, Gitlab and many other OSS projects have UIs that are usable by a non-technical audience. Airbyte has one and it is one of its primary features.

7. Security and privacy compliance

We’re talking about data, and most probably this involves customer data. All companies are subject to privacy compliance laws, such as GDPR, CCPA, HIPAA, etc. As a matter of fact, above a certain stage (about 100 employees) in a company, all external products need to go through a security compliance process that can take several months.

With open source, teams just use the technology directly without such processes. The adoption is frictionless, and the engineering need can be met overnight.

8. Moving data between internal databases

Commercial products mostly sit in the cloud. So if you have to replicate data from an internal database to another, it makes no sense to have the data move through the external vendor. In addition to the vendor’s costs, you need to add egress costs and the issue it poses in terms of privacy.

Open-source products are mostly self-hosted, so they don’t have this problem.

Is open source hard without a direct revenue channel?

One of the comments in the thread was that “open source is hard without a direct revenue channel to support it.” That is actually true for any company. If you have a hard time getting revenues, you can’t scale the team or your effort.

According to Bessemer, open-source companies grow faster in revenues than many cloud leaders.

Some other interesting quotes / points

“The general argument for open-source software that applies to all software: Having access to the source code means users have freedom--the freedom to change the software how they see fit and fix issues.”

“If you spent months or years using, say, Fivetran, and all of a sudden Fivetran goes out of business or significantly changes its business, you could run the software yourself, even if just for a period of time. I’ve been there when it’s happened with other software; it’s not pretty, and sometimes you only have days or weeks of heads up.”

“If it’s open source, I can run it without Fivetran, so for special circumstances, like compliance or high volume, I don’t have to deal with legal or cost limitations.”

“I can see the code running my integration; I can more quickly debug issues. I can also easily create integrations anyone can choose to use. Tools like Fivetran and Stitch only work a percentage of the time; open source allows me to fill in that gap without creating one-off scripts (or Fivetran lambda functions) or going back and forth for weeks or months with support on a bug that I could fix in a few hours.”

Conclusion

If the open-source project is well structured and well thought through, there will be no points where a commercial approach can be better. The issue is that, until now, there were no good open-source projects that could alleviate the dangers to this approach.

If you like how we think an open-source project can address all the possible concerns by building a single platform with a single repo, a defined data protocol, a ready-to-use UI, and decoupling of EL and normalization, you should definitely have a look at what we’re building at Airbyte. We want to become the open-source standard for data integrations.

Let us know if you think we missed anything. Our goal is to see things from all perspectives and to keep this article up to date.

DEV Community

Solving Data Integration: The Pros and Cons of Open Source and Commercial Software

TL;DR

1. Having a large number of high-quality well-maintained pre-built connectors

Commercial approach

Open-source approach

2. Addressing unique corner cases and unique needs

Commercial approach

Open-source approach

3. Pricing model indexed on usage

4. Debugging autonomy

5. Getting the internal resources

6. Easy to use by any teams (analysts, finance teams, etc.)

7. Security and privacy compliance

8. Moving data between internal databases

Is open source hard without a direct revenue channel?

Some other interesting quotes / points

Conclusion

Top comments (0)

Read next

Pulumi vs Terraform: An In-Depth Comparison

How to build a web application from scratch with no experience

What is an Abstract Syntax Tree in Programming?

Difference between WHERE vs HAVING Clause in SQL