Making Data Engineering Easier: Operational Analytics With Event Streaming and Reverse ETL

#datascience #eventstreaming #dataengineering #customerdataplatform

The past several years have been somewhat of a revolution in data technologies as tools like the cloud data warehouse have come of age, and principles borrowed from software engineering are rapidly being turned into data-focused SaaS.

There is a huge amount of information about what these tools allow you to do, and rightly so. In many ways the modern data stack fulfills promises that siloed SaaS tools never could. For example, getting data out of data silos to build a complete, single source of truth customer profile is actually possible in the data warehouse/data lake.

Tooling has also emerged that helps companies "put data to work" once they have created some sort of value with it (most often in a data warehouse like Snowflake, Google BigQuery, Amazon Redshift, etc.). This has been called operational analytics. Really, operational analytics is just a marketing term coined to describe the reverse ETL workflow of syncing data between your warehouse and the downstream business tools that marketing teams, sales teams, customer success teams, etc. use to run the company.

Combining all of these tools enables helpful use cases that weren't possible before. The complete customer profile example is a good one to describe this architecture.

Building the profiles in the warehouse requires data integration from data sources across the stack. Running ETL/ELT pipelines and collecting behavioral data gives you the full set of user traits in the warehouse. The next logical step is what some call the last mile, sending the full set of characteristics (or a relevant subset) to the operational systems different teams use to run the business.

This could be marketing automation tools, like Marketo or Braze, sales tools, like Salesforce or Hubspot CRM, customer support tools like Zendesk, business intelligence tools like Tableau or Looker, you get the point.

By delivering enriched data from the warehouse to all these tools, data teams effectively give business teams superpowers, equipping them with comprehensive customer information in the tools they use every day to do things like prioritize outreach, build better email lists, optimize marketing campaigns, craft better customer experiences, etc.

Reverse ETL makes data engineering easier

Because so much attention is placed on use cases for business users, like the one above, it's easy to overlook one of the biggest benefits of all this modern tooling (and reverse ETL tools in particular): they make data engineering easier.

Even before the modern data stack, it was technically possible to build a complete customer profile in an operational system like Salesforce, but it was a painful, expensive endeavor that pushed SaaS tools far beyond their intended uses. It also required multiple full time roles (and often consultancies) to maintain the brittle Frankenstein of connectors and customizations. For most companies, the juice wasn't worth the squeeze.

Thankfully, the modern data stack is making data engineering easier, enabling data teams to focus on more valuable problems, as opposed to building and maintaining low-level data infrastructure just to move data.

In the next section of this post, I'll give an overview of a real-world data challenge we faced at RudderStack and how we used our own customer data pipelines, including our Reverse ETL solution, and warehouse to solve it.

A classic data challenge: getting richer conversions into ad platforms

Instrumenting pixels and conversions from ad platforms is one of the more frustrating tasks for data teams, especially when you try to implement anything more complex than the most basic use cases. In what can only be described as perverse irony, those more complex use cases are actually what unlock the most value from digital advertising because they give the algorithms better data for optimization.

One of the most common use cases is sending some sort of signup or lead conversion to a platform like Google Ads. For RudderStack, those conversions are either form fills on the website or account creations in our app. Thankfully, we don't have to deal with all of the direct instrumentation using something like Google Tag Manager because RudderStack makes it easy to map existing event stream identify and track calls (like "User Signed Up") to conversions in Google Ads.

Sending all signups as conversions is helpful, but our paid team wanted to take it a step further: specifying which conversions represented qualified signups (i.e., MQLs). That sounds simple on the surface, but it's actually a tricky data engineering problem because our MQL definition is both behavioral (did the user perform certain behaviors) as well as firmographic/demographic (company size, industry, job title, etc.). To add even more spice to the challenge, our firmographic/demographic enrichment happens via Clearbit API calls, and the ultimate 'operational home' for the MQL flag on a user is Salesforce.

Think about building a custom solution just to try to get those enriched conversions into Google Ads. Yikes.

The good news is that Google provides a data point that opens a pathway to tying all of this together. We won't go into detail, but it is called the "gclid," which stands for "Google click ID." This value represents a unique click by a unique user on a particular ad. It's appended to the end of the URL that the user is directed to when they click.

Using the gclid and some clever workflows in our stack, we are able to not only send MQL conversions to Google Ads, but also drive much more detailed reporting on our paid advertising performance.

We will cover the step-by-step details in an upcoming guide, but here's the basic workflow:

First, grab the gclid value from the URL property of a page call, and add it to the user profile table in the warehouse

One widespread method of capturing the gclid on a lead record is creating a hidden field on forms or signup flows, grabbing the value from the URL, and populating the hidden field so that on submission, the user record has a gclid value (in some sort of custom field). The main problem with that method is that many users don't fill out the form on the page they land on (they browse the site for more information), meaning the UTM parameter is lost on subsequent pageviews.

Mitigating that problem is fraught with peril and many data engineers attempt the annoying practice of persisting the gclid value in the data layer, often in Google Tag Manager, so that it stays with the user as they browse the site. Not only is this brittle, but becomes an enormous pain when you have to build similar workflows for other ad platforms. Suffice it to say, this is a huge data engineering headache no matter which way you slice it.

RudderStack, dbt, and the warehouse make this much, much easier. We already track anonymous pageviews for each unique user, and those page calls include the URL as a property in the event payload. When the user eventually converts, we have the associated track and identify calls in the warehouse, so we can tie all of the previously anonymous pageviews to that now-known user.

We already run a dbt model that generates enriched user profiles, which includes the MQL flag from Salesforce pulled via a RudderStack ETL (extract, transform, load) job. A few simple lines of SQL allow us to grab any previous pageviews associated with MQLs that include URL properties with a glcid value.

Easy as that, we have the gclid value added to MQL user profiles without all of the mess of trying to persist values, do direct instrumentation with Google Tag Manager, or build a bunch of custom pipelines.

Second, send the gclid value to Salesforce via reverse ETL and sync MQLs to Google Ads via direct integration

At this point, a majority of the operational analytics work is done. We just need to complete the loop by getting the enriched MQL conversions back to Google Ads.

There are a few ways to do this: you could send the enriched conversions directly to Google Ads, but we chose to use the direct integration between Salesforce and Google Ads so that our data engineering team didn't have to maintain additional pipelines.

So, the only thing left to do was get the gclid value into Salesforce. Amazingly, that was already taken care of by the reverse ETL job we were running to add other enriched data points (like lead score) to lead records in Salesforce. Revenue ops created a custom field, which we mapped in our RudderStack Reverse ETL job, and the data flow was complete.

On to more important problems

Our team wired up, tested, and deployed this solution in a few hours. I have battle scars from wrangling I've done in the past to achieve similar data flows, so it was pretty mind boggling to see how easily we accomplished this.

The best part, though, was that it gave our data team time to move on to the bigger, more important problem of collaborating with our analytics team to build better reporting on ad performance.