- What is it?
- Why do you need it?
- How do you implement it?
- Who is it for?
- Future and next steps
Data integration is the process of loading data from an external source and then exporting it to an internal destination. This process can also be reversed (e.g. reverse ETL).
You regularly collect gold from scattered kingdoms that have pledged loyalty to your throne. You collect their gold payments through Stripe.
Your treasury (aka finance team) wants to better understand which demographics of kingdoms pay well and on time.
They need the gold payment data from Stripe in your Snowflake data warehouse so that they can combine it with the kingdom census data. Once combined, they can analyze the data and report it to the high council.
Use cases for integrating external data with internal systems include (but aren’t limited to):
- Combine data from multiple sources to build comprehensive data models for business use cases
- Analyze data from external SaaS tools
- Combine data for personalization
- Getting internal application data into marketing tools (e.g. Mailchimp, Google Ads, etc.) for outreach campaigns, ads, etc.
- Syncing user data with CRMs (e.g. Salesforce)
- Exporting internal data to task management (e.g. Airtable) software for ops
You’re having a big tournament to celebrate your next heir to the throne. You put out lots of ad posters all over the land. Many knights sign up to compete in your tournament.
Your team uses the ad campaign performance data, knight sign ups, tournament results , and spending from spectators to calculate the ROI of ad campaigns. These models will help your team better promote the next tournament more efficiently.
There are 3 ways you can accomplish data integration:
- Software-as-a-service (SAAS)
- Open-source software
- Write the code from scratch or use open-source libraries
Some benefits of using SAAS or open-source software is that you get some out-of-the-box features such as (not a comprehensive list):
- When a 3rd party API gets updated, maintainers will help update it
- When there is a duplicate record, you can automatically ignore it or update an existing record
- Automatically track progress and status of synchronizations
- Retry failed synchronizations
- And more…
Typically, data engineering owns data integrations. Data engineers can choose how it’s implemented (e.g. buy SAAS or implement open-source software).
Marketing, sales, operations, etc. can help influence which 3rd party sources need data integration or which 3rd party APIs need data synced to it from internal data warehouses.
Depending on which sources are required by specific business use cases, different solutions are required.
Companies use SAAS or a self-hosted open-source software to integrate data from common sources. When an uncommon source doesn’t exist out-of-the-box from a SAAS provider, teams typically write the code themselves to handle the synchronization.
In addition, when a team needs to export data from their internal systems to 3rd party APIs (e.g. Salesforce) and has very customized needs, they also typically write custom code to handle their specific use case.
Once data is integrated into a common destination, you can begin to combine it with data from other sources, analyze it, then predict on it.
In order to accomplish that, you’ll first need to clean the data, wrangle it, and transform the data all through a data pipeline. Managing this process in a testable, repeatable, and observable way is critical.
In this data integration series, upcoming content will include:
- Singer Spec: the data engineering community standard for writing data integrations
- How to write your own data integration
- How to build an end-to-end data pipeline to sync data