What is data integration?

#dataintegration #datapipelines #datascience #dataengineering

TLDR

Data integration is synchronizing information between 1 system (e.g. Salesforce) and another (e.g. Snowflake)

Source: Giphy

Outline

What is it?
Why do you need it?
How do you implement it?
Who is it for?
Future and next steps

What is it?

Data integration is the process of loading data from an external source and then exporting it to an internal destination. This process can also be reversed (e.g. reverse ETL).

Scenario

You regularly collect gold from scattered kingdoms that have pledged loyalty to your throne. You collect their gold payments through Stripe.

Source: Giphy

Your treasury (aka finance team) wants to better understand which demographics of kingdoms pay well and on time.

They need the gold payment data from Stripe in your Snowflake data warehouse so that they can combine it with the kingdom census data. Once combined, they can analyze the data and report it to the high council.

Why do you need it?

Use cases for integrating external data with internal systems include (but aren’t limited to):

Combine data from multiple sources to build comprehensive data models for business use cases
Analyze data from external SaaS tools
Combine data for personalization
Forecasting

Example use cases for loading data from your internal systems (e.g. PostgreSQL) and then exporting it to an external system (e.g. HubSpot) include (but aren’t limited to):

Getting internal application data into marketing tools (e.g. Mailchimp, Google Ads, etc.) for outreach campaigns, ads, etc.
Syncing user data with CRMs (e.g. Salesforce)
Exporting internal data to task management (e.g. Airtable) software for ops

Scenario

You’re having a big tournament to celebrate your next heir to the throne. You put out lots of ad posters all over the land. Many knights sign up to compete in your tournament.

Source: Giphy

You build a data integration between all the ad service providers (e.g. Google Ads, Facebook Ads, etc.), ingesting data from those APIs and storing it in your BigQuery data warehouse.

Your team uses the ad campaign performance data, knight sign ups, tournament results , and spending from spectators to calculate the ROI of ad campaigns. These models will help your team better promote the next tournament more efficiently.

How do you implement it?

There are 3 ways you can accomplish data integration:

Software-as-a-service (SAAS)
- Stitch
- Fivetran
- Hightouch (reverse ETL)
- Census (reverse ETL)
Open-source software
- Mage
- Airbyte
Write the code from scratch or use open-source libraries
- Singer
- Meltano
- Mage
- Write custom code

Source: Giphy

Some benefits of using SAAS or open-source software is that you get some out-of-the-box features such as (not a comprehensive list):

When a 3rd party API gets updated, maintainers will help update it
When there is a duplicate record, you can automatically ignore it or update an existing record
Automatically track progress and status of synchronizations
Retry failed synchronizations
And more…

Who is it for?

Typically, data engineering owns data integrations. Data engineers can choose how it’s implemented (e.g. buy SAAS or implement open-source software).

Marketing, sales, operations, etc. can help influence which 3rd party sources need data integration or which 3rd party APIs need data synced to it from internal data warehouses.

Depending on which sources are required by specific business use cases, different solutions are required.

Source: Giphy

Companies use SAAS or a self-hosted open-source software to integrate data from common sources. When an uncommon source doesn’t exist out-of-the-box from a SAAS provider, teams typically write the code themselves to handle the synchronization.

In addition, when a team needs to export data from their internal systems to 3rd party APIs (e.g. Salesforce) and has very customized needs, they also typically write custom code to handle their specific use case.

Future and next steps

Once data is integrated into a common destination, you can begin to combine it with data from other sources, analyze it, then predict on it.

In order to accomplish that, you’ll first need to clean the data, wrangle it, and transform the data all through a data pipeline. Managing this process in a testable, repeatable, and observable way is critical.

Source: Giphy

In this data integration series, upcoming content will include:

Singer Spec: the data engineering community standard for writing data integrations
How to write your own data integration
How to build an end-to-end data pipeline to sync data