DEV Community

Cover image for RudderStack: How Pachyderm Pipelines Help Parse Customer Event Data
Team RudderStack for RudderStack

Posted on • Originally published at rudderstack.com

RudderStack: How Pachyderm Pipelines Help Parse Customer Event Data

Astasia from RedPoint Ventures wrote a great post on new technologies supporting "reverse ETL" functionality in the customer data stack.

We're excited to be innovating in the area of reverse ETL tech (via our Warehouse Actions feature), and our product and engineering teams discuss these topics and industry trends often, so we thought it would be helpful to provide a bit more technical depth on a few of Astasia's points.

1) Data Movement Differs Between Event Streams and Tabular Data, Which is an Important Consideration for Reverse ETL

Differences in Moving Data

ETL/ELT solutions all accomplish a similar function, moving data, but there are several foundational differences to keep in mind when it comes to the data. One of those is the difference between event stream data and tabular data.

Everyone is familiar with this distinction on the ingestion side of the stack. Ingesting event streams is different from ingesting tabular data from SaaS applications (like Salesforce), not just in how the data is generated, i.e., pulled via API vs. generated by an SDK also in the structure of the data itself.

This distinction has significant impacts, from real-time requirements to downstream reporting implications. This distinction has led to different vendors doing well in each category (i.e., Segment for event streaming and Fivetran for tabular data), but modern companies have to leverage both.

At RudderStack, we believe there is an opportunity to do both well, together. In fact, a unified stack to do both can achieve some interesting things, like cross-pipeline identity stitching (i.e., joining Salesforce record IDs with anonymousIDs) and unified data governance. We're building these solutions at RudderStack, but that's a topic for a different post.

Event Stream vs. Tabular Data for Reverse ETL

When you hear the term reverse ETL, it's easy to think only of tabular data. The distinction still exists, though, and you can (and should) distinguish between event stream data and tabular data.

Astasia touched on a few use cases for tabular data, but reverse ETL as an event stream is equally important. One use case we see quite often among our customers is sending events stored in logs (e.g., generated by your back end application and dumped into S3) into destinations like Google Analytics and Amplitude for analytics or platforms like Braze for marketing.

Many of our customers also perform more advanced processing like data mining or ML modeling of logs before sending them as events, then use RudderStack to pipe the data.

2) Table Sync can be Modeled as an Event Stream

An important point related to tabular vs. event stream data is that tabular data can be modeled as events, but not necessarily the other way around. On the ETL/ELT side, CDC technologies (or incremental pulls) have generated quite a bit of interest because there are advantages of representing that data as events versus doing a batch pull via API.

Some of those advantages include incremental syncs for real-time updates, maintaining a consistent point-in-time state, and routing the data to streaming technologies (we will cover this topic in more depth in a future post).

This is possible because tabular data are actually a subset of event data. Similarly, batch processing is a subset of streaming processing. This means that tabular data can be derived from a stream of events and recreate the final state at any point (see event sourcing architecture). This is not true for the inverse, though.

In fact, the reason the industry adopted the tabular/batch model was primarily technical---it is much more difficult to build and manage streaming data. However, this is changing with technologies like RudderStack.

At RudderStack, we have modeled table sync as an event stream in our reverse ETL solution.

At a high level, syncing a row from your warehouse to a 'row' in a cloud application is an "event" that specifies which data points are being mapped. Tools like Segment and RudderStack already accomplish that functionality with .identify calls in the event stream, so there is no inherent limitation of the data model for the use case.

So, while there are certainly different user experiences for streaming-based solutions like RudderStack vs. table sync solutions like Census, they are primarily variations in UI/UX.

3) The Limitations of Segment's Personas Product for Reverse ETL

Astasia made a great point about the distinction between reverse ETL and Segment's Persona's product. Personas are a powerful product feature, but it isn't a reverse ETL solution. The reason is simple: Personas treats the user profile as a first-class object in data sync. The practical implication of this is that all data sync must conform to a contact/account structure. Still, as the recent increase in reverse ETL startups has shown, companies need sync functionality that serves a much wider range of use cases.

Reverse ETL as an event stream directly from the warehouse is unhindered by those limitations. In fact, with our Warehouse Actions feature, our customers can turn warehouse tables into a flexible, configurable event stream. That includes updating contacts, accounts, and audiences, but can also support sending of cleansed internal events, derived proxy events (events represented by the absence of behavior), and use cases where the data needs to be delivered to other infrastructure via tools like Kafka, Redis or HTTP endpoints.

4) Reverse ETL is Still Just Data Movement, and a Single Pipe Simplifies Your Stack, Security, and Data governance

Our mission at RudderStack is to help data engineers become the heroes of their companies by providing every team with rich data. We want to make their jobs easier, and part of that mission is simplifying data management into one pipeline.

In the modern stack, the warehouse is king, and many destinations are also becoming sources (and vice versa!).

For example, sources of data often include

  • Events coming from your client or server-side apps
  • Data from your SaaS tools
  • Data from your internal databases
  • Data from your warehouses and data lakes (and...lakehouses)
  • Data from internal event streams (like Kafka).

When all of those sources are also destinations, almost every combination of source and destination is a use case, which creates some important categories of tooling in the customer data stack:

  • App to warehouse/SaaS (event streaming)
  • SaaS to the warehouse ('traditional' ETL/ELT)
  • Warehouse to SaaS (new 'reverse ETL' category)
  • SaaS to SaaS (API to API category)

Increasingly, those categories within the stack need to support important use cases that are becoming standard but are still challenging for many companies to implement from a technical standpoint:

  • Enabling customer-facing ML use-cases by sending live events to a key-value store (like Redis) for real-time personalization
  • Enabling internal ML use-cases by pulling data, enriching ML, then sending it from the warehouse to tools for internal teams (i.e., Salesforce, Marketo, etc.)
  • Streaming internal events from Kafka to SaaS applications
  • Feeding transformed data (features) to feature stores (like Tecton)

When you step back and look at those categories and the use cases they must support, it becomes clear that a business could easily have to build or buy a significant number of technologies to enable all of the functionality. In fact, we hear from companies all of the time about the pain of managing multiple technologies and vendors (which means contracts) across data pipelines.

One commenter on Astasia's LinkedIn post said it this way:

"I think we have failed as technologists if we need to build different tools to load data into a warehouse and get data out of the warehouse. This is "focus on doing one thing well" and "you must create a new category" gone too far."

We agree. Customers tell us all of the time that when it comes to managing pipelines in the context of the modern data stack, "best of breed" is becoming problematic to manage---after all, it's the same customer data.

At RudderStack, we're building the complete customer data stack for simplified pipeline management, including the reverse ETL component.

Sign up for Free and Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started.This post explores the customer data stack of business partner Pachyderm, which offers a Kubernetes-based ETL (extract, transform, load) tool of the same name. We'll show how Pachyderm leverages real-time customer event data across different sources to gain deeper insights into user behavior in its product and optimize the UX to increase customer adoption.

What is Pachyderm?

Pachyderm was designed to make building and managing end-to-end ML/AI pipelines easier, regardless of their size and complexity. With Pachyderm, you can track your data lineage and bring together version control for your data with the tools, languages, and frameworks of your choice to build scalable data science pipelines.

With Pachyderm, there are three easy ways to get up and running with your data science pipelines. You can either deploy Pachyderm in your local environment, on your favorite cloud provider, or use the company's hosted and fully-managed SaaS platform: Pachyderm Hub.

How Pachyderm uses its Data

Being a modern data science platform, all the teams within Pachyderm make extensive use of data for all their operations and KPIs. The customer team at Pachyderm uses their customer event data --- collected, routed, and warehoused with RudderStack Event Stream --- to understand user behavior and product usage patterns better. They also use this data to drive effective inbound and outbound marketing campaigns.

Pachyderm uses product usage data --- again collected, routed, and warehoused with RudderStack Event Stream --- from within the Pachyderm Hub, its SaaS platform, for product analytics and optimizing the platform. They then leverage their event stream and product usage data along with non-event data from their cloud tools like Salesforce, HubSpot, Zendesk, Slack, and Google Analytics --- collected and warehoused with tools like Fivetran and Stitch.

With a unique combination of customer event data, product usage data, and data from customer tools, along with modern data storage and processing tools, Pachyderm has set up a robust, powerful stack to leverage unlimited data for all activation use-cases.

Pachyderm's Customer Data Stack

  • Data Collection and Synchronization: RudderStack SDKs, RudderStack Event Stream, RudderStack Warehouse Actions, Stitch, Fivetran
  • Data Warehouse: Google BigQuery
  • Data Processing and Business Intelligence: Sigma Computing
  • Cloud Toolset for Activation Use-cases: HubSpot, Google Analytics, Zendesk, Salesforce, Outreach.io

How Data Flows Through the Pachyderm Stack

Here's a visual representation of the data flow through Pachyderm's data stack:

Pachyderm pipelines

Pachyderm pipelines

Data Collection

Pachyderm generates hundreds of gigabytes of data through tracking user interactions and data from cloud sources. All of this data resides in the customer data lake built on top of their data warehouse, Google BigQuery.

Data Streaming and Warehousing

Pachyderm teams mainly use customer event data to track user journeys and trigger various downstream use-cases (such as marketing campaigns). They use RudderStack Event Stream to route event data to downstream destinations such as HubSpot and Google Analytics and their data warehouse Google BigQuery. RudderStack's seamless integration with these platforms allows them to stream near real-time events for their activation use-cases.

To track and capture event data on their web app, Pachyderm uses the RudderStack JavaScript SDK. For server-side capturing, it uses the RudderStack Go SDK. In addition to the SDKs, Pachyderm also uses the RudderStack HTTP API tracking for seamless recording and capturing of customer events.

"RudderStack has given us better access to our data. Our data was siloed in cloud sources. Now we have it all in a warehouse, making it accessible to everyone."\
--- Dan Baker, Marketing Ops Manager, Pachyderm

Data from Its SaaS Platform and Cloud Tools

Apart from customer events, Pachyderm collects data from Pachyderm Hub, its SaaS platform. This data mainly includes the customers' workspace usage details and other metrics related to the platform usage and performance. Pachyderm clusters (which host both their open-source and enterprise offerings) also generate a large chunk of the workspace usage data. This data is processed by Webhook Relay and streamed directly into the company's data warehouse.

"With RudderStack, there are fewer moving pieces. Having RudderStack means that we don't have to have a bunch of manual processes or APIs that we have to build for and support ourselves."\
--- Dan Baker, Marketing Ops Manager, Pachyderm

Pachyderm also pulls different types of data from a variety of cloud sources. Some of these include marketing data from HubSpot, customer and CRM data from tools like Salesforce and Zendesk, messaging data from Slack, advertisement and campaign data from tools like Google Ads, and insights from Google Analytics. To collect this data, Pachyderm uses popular ETL tools like Fivetran and Stitch.

All of this information is useful for stitching together comprehensive customer profiles and understanding Pachyderm's in-product behavior.

Data Transformation and Business Intelligence

Once the data is in its data warehouse, Pachyderm uses Sigma --- warehouse-focused analytics and BI tool --- for internal reporting and metrics tracking. Beyond general BI, it also uses Sigma to build queries, join tables, and aggregate the data stored from various sources, to give a single source of truth for customers.

This transformed data is materialized in BigQuery for storage. The software then uses RudderStack Warehouse Actions to send this transformed data from the warehouse to downstream destinations like HubSpot.

*"*We already have a bunch of uses for Warehouse Actions, and we're only at the tip of the iceberg. Our future plans involve getting more data insights out of the warehouse, and Warehouse Actions will continue to be our go-to solution for this."\
--- Dan Baker, Marketing Ops Manager, Pachyderm

Data Activation Use-Cases

The way Pachyderm uses transformed data for the activation use-cases is quite interesting. All the customer event data and data coming from the apps and systems is primarily used for lead qualification and analyzing customer journeys and behavior.

Lead Qualification

When a user first signs up on Pachyderm, the first course of action suggested is to create a workspace. Pachyderm's customer team encourages this action with drip emails. Once the user has created a workspace, an event is sent from the application backend to their data warehouse.

The team then uses Sigma to determine the total number of workspaces created and workspaces created since the last run and materialize this data on the data warehouse. This information is then sent back to HubSpot with RudderStack Warehouse Actions. Once in HubSpot --- the inbound lead system --- this data is synced with Salesforce --- the outbound lead system. After the behavioral data from the application has made its way into their CRM, they use Outreach.io to drive their personalized messaging and email campaigns, and (in this example) they stop sending drip emails to a user that has created a workspace.

"Getting aggregated event data into HubSpot and Salesforce is key for our Sales team to be able to effectively communicate with our customers."\
--- Dan Baker, Marketing Ops Manager, Pachyderm

Customer Journey and Behavior Analysis

With event data captured across various customer touchpoints, the Pachyderm teams can now understand their customer's product journey and pinpoint any issues. They also get deeper insights into how their users are using the platform and optimize those specific aspects of the product to improve their customer experience.

Simplified Data Access and No Data Silos

Before incorporating RudderStack into their customer data stack, the team at Pachyderm had to do any plumbing in HubSpot to move the data around the data warehouse. This, in turn, created a lot of data silos. They looked at other customer data platforms to solve these problems, but the cost was unjustifiable given what they were getting in return.

With RudderStack now a core component of their customer data stack, Pachyderm can get all the event data into a centralized location --- the data warehouse --- to be used efficiently for analytics. In addition, RudderStack Warehouse Actions sends insights from the analysis to downstream tools for lead qualification and personalized marketing.

Sign up for Free and Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started.

Top comments (0)