Why You Should NOT Build Your Data Pipeline on Top of Singer

#database #datascience #opensource

Singer.io is an open-source CLI tool that makes it easy to pipe data from one tool to another. At Airbyte, we spent time determining if we could leverage Singer to programmatically send data from any of their supported data sources (taps) to any of their supported data destinations (targets).

For the sake of this article, let’s say we are trying to build a tool that can do the following:

Run any Singer tap or target
Provide a UI for configuring and running those taps and targets
Count the number of records synced in each run

In the context of these goals, being able to use Singer programmatically means writing a program that can, for any integration:

provide a UI with instructions on what information a user needs to input in order to configure that integration (e.g., host, password, etc).
take those user-provided values and execute each integration.

We know that the described requirements are not the use case that Singer sets out to solve, but nonetheless, we wanted to see if we could leverage Singer to bootstrap building out this case. Sure enough, we ran into some “gotchas” along the way. These gotchas illustrate some of the core primitives that a programmatic data integration tool requires.

Integrations do not declare their configurations

The Singer protocol does not specify how an integration should define what inputs it requires. This means that, in order to use most Singer taps, you need to scour the entire implementation to figure out what properties it uses; depending on the complexity of the integration, this can be pretty painful.

Some integrations help out by specifying what the configuration should look like in a readme or in a sample config. Even these lead to headaches. They often just list the fields that need to be passed in but do not explain what they mean, what their format is, or how to find them (good luck trying to find all the information you need to configure your Google Ads integration!). In other cases, they only list a subset, and then you have to discover the rest by reading the integration (e.g., tap-salesforce doesn’t mention is_sandbox in the docs UPDATE: someone has now added this field in the readme with this PR).

These taps are great; we have happily used all of them, but because they do not specify what is required to configure them, they can’t be used programmatically. Specifically, our program needs to know that for the Postgres tap it requires the field’s hostname and port. Without this specification, the program cannot figure out how to build a valid configuration for an integration. This configuration is expensive to shim, because it requires engineering work for every single integration!

No way to tell which Singer feature is compatible with which integration

Singer has excellent documentation around its core protocol. It also does a nice job defining the suite of special metadata that it supports. When you start actually using Singer, however, mapping these primitives onto your integrations is difficult. For example, “replication-method” sets whether all the data from the source should be replicated (“full_table”) or just the new or updated data (“incremental”). What is unclear is which taps actually support “incremental” or “full_table” or both.

Taps do not advertise, in a way that is programmatically consumable, which of these replication methods they support. Some of them mention it in their documentation, but ultimately that’s insufficient for the type of tool we want to build. So what happens when you request “incremental” from a source that only supports “full_table”? The behavior is undefined. Some taps will throw an error, some will just do a full refresh. Either way, from the point of view of the UI-based tool that we are trying to build, this isn’t really usable.

The problem only gets hairier for some of the more niche metadata as well (e.g., “view-key-properties”). You either need to read the source or just try it out and see if the configuration works. This problem is adjacent to the configuration problem described in the previous section, and, similarly, requires a shim for every integration.

Singer’s own secret menu

If you’re from the West coast, you might be familiar with how In-N-Out Burger popularized the “secret” menu in fast food chains. While charming at a drive thru, secret menus can ruin your data integration.

The Singer protocol has some of its own secret menu items. For example, we were parsing each message that a tap output into JSON using the declared schema in the Singer docs. We were trying to understand really well what messages were being sent between taps and targets, so we would fail loudly if anything was sent that did not match the documented message types. Then we started getting errors on “ActivateVersionMessage.” After spelunking in the source code for a bit, we found that this message type has existed in Singer as an experimental feature since 2017. A handful of the official Singer taps use it, but there’s no guidance on what you’re supposed to do with it (I suspect it is a feature used internally at Stitch--the paid, managed solution from the creators of Singer). If you’re building something programmatic on top of Singer, your choice is to just filter it out or let it pass and hope that stuff…just works, I guess?

Handling this one case is not the end of the world, but it leaves you feeling uncertain what else is lurking in the protocol that might not play well with your system.

Conclusion

So to answer our original question, can we reasonably stretch the Singer to meet our product requirements? The answer is no. Doing so would require writing custom shims for every single Singer tap and target. Since the goal with data integrations is always to scale to more integrations, having to do any work on them per integration is very expensive.

The Singer protocol is underspecified for this use case. This realization makes sense, because ultimately this is not the use case for which the protocol is trying to solve. Achieving these requirements depends on integrations declaring much more information about how they are configured and which features they support. We are tackling this problem at Airbyte, so if you are looking for an OSS solution that makes it easy to move your data into a warehouse, instead of trying to roll your own on top of Singer, come check us out!

This article is meant to be the first in a pair of articles. The second will explore the engineering journey that we took to figure out where Singer should fit into our system.

DEV Community

Why You Should NOT Build Your Data Pipeline on Top of Singer

Integrations do not declare their configurations

No way to tell which Singer feature is compatible with which integration

Singer’s own secret menu

Conclusion

Top comments (0)

Read next

How to Handle N+1 Queries for Optimal Database Performance in Django?

Qdrant 1.8.0 - Major Performance Enhancements

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples

94% on CIFAR-10 in 3.29 Seconds on a Single GPU