Continuing the Big Data topic, I want to share with you this post about Google Cloud Data Fusion.
Cloud Data Fusion is based on Cask™ Data Application Platform (CDAP).
CDAP was created by a company named Cask and this company was bought by Google last year. CDAP was incorporated into Google Cloud and named as Google Cloud Data Fusion.
Data Fusion is a fully managed CDAP with steroids 🧬.
From the Google Cloud page:
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
On important info about Data Fusion is it rely on Cloud DataProc(Spark), handling the cluster (create and delete) for you.
The game-changer of Data Fusion is the amazing graphic interface providing for the user an easy to use way, to create from a simple transformation pipeline to the complex ones. The best: without a line of code.
This post will detail the first 2 options: Wrangler & Integrate
At the end of this post, you can watch a video tutorial where I show a step-by-step using Data Fusion.
Wrangler is the center part to prepare your raw data. You can upload files or connect to a variety of external sources like databases, Kafka, S3, Cloud Storage, BigQuery and Spanner.
Right after you choose your source, you are redirected to the parsing mode.
As you can see at the top, there's a tab called insights. There you can see some useful graphs about your data:
On the Studio, you have all the great tools to create your data pipeline. The source, the transformation, and the sink. Each one of them with a diversity of choices. Take your tool!
The main page of Studio, on the left you have the Source, Transform, Analytics, Sink, "Conditions and Actions" and "Error Handlers And Alerts". The gray area where you design your pipeline.
A simple but complete pipeline:
All available tools in the Studio:
You can also install more sources and sinks, in the Hub:
After design your pipeline on the Studio, you need to deploy it. Just click the "Deploy" button, then you can see your pipeline on the next page. On this page, you'll be able to run the job, watch the logs, configure Schedule and also see the Summary of executions.
At the top in the center, click on "Summary" to check some facts about your jobs.
This step-by-step was recorded and edited by me(sorry any issue).
You'll be able to see a complete design and execution of a pipeline. The steps of this pipeline are:
- Ingestion a CSV
- Parse the CSV
- Prepare some columns of the CSV
- Use the Join Transformation tool
- Connect to a PostgreSQL(CloudSQL) instance
- Get information about the States of Brazil
- Join it with the correspondent column of the CSV
- Output the result in a new table on BigQuery
As you can see, Data Fusion is an amazing tool for data pipelines. Powerful because it can handle as much data you have, taking the advantage of Google Cloud features, and impressively easy to design the workflow. Another great feature is the possibility to create your connector, as it is based on CDAP, you can develop your connector and deploy on HUB.
Data Fusion is GA since last week and has a lot of big customers already using. It is just the beginning of this incredible tool.
As usual, please share this post and consider giving me feedback!
Thank you so much!