DEV Community

Rajesh Natarajan
Rajesh Natarajan

Posted on

Cloud Data Fusion, a game-changer for GCP

Continuing the Big Data topic, I want to share with you this post about Google Cloud Data Fusion.

The foundation

Cloud Data Fusion is based on Cask™ Data Application Platform (CDAP).
CDAP was created by a company named Cask and this company was bought by Google last year. CDAP was incorporated into Google Cloud and named as Google Cloud Data Fusion.

Image description
Data Fusion
Data Fusion is a fully managed CDAP with steroids 🧬.

From the Google Cloud page:

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.

On important info about Data Fusion is it rely on Cloud DataProc(Spark), handling the cluster (create and delete) for you.

The game-changer of Data Fusion is the amazing graphic interface providing for the user an easy to use way, to create from a simple transformation pipeline to the complex ones. The best: without a line of code.

Image description
This post will detail the first 2 options: Wrangler & Integrate
At the end of this post, you can watch a video tutorial where I show a step-by-step using Data Fusion.

Wrangler
Wrangler is the center part to prepare your raw data. You can upload files or connect to a variety of external sources like databases, Kafka, S3, Cloud Storage, BigQuery and Spanner.

Image description
Right after you choose your source, you are redirected to the parsing mode.

Image description
As you can see at the top, there's a tab called insights. There you can see some useful graphs about your data:
Image description
Studio
On the Studio, you have all the great tools to create your data pipeline. The source, the transformation, and the sink. Each one of them with a diversity of choices. Take your tool!

Image description
The main page of Studio, on the left you have the Source, Transform, Analytics, Sink, "Conditions and Actions" and "Error Handlers And Alerts". The gray area where you design your pipeline.

A simple but complete pipeline:

Image description
All available tools in the Studio:
Image description
You can also install more sources and sinks, in the Hub:
Image description
Pipeline
After design your pipeline on the Studio, you need to deploy it. Just click the "Deploy" button, then you can see your pipeline on the next page. On this page, you'll be able to run the job, watch the logs, configure Schedule and also see the Summary of executions.

Image description
At the top in the center, click on "Summary" to check some facts about your jobs.
Image description
Conclusion
As you can see, Data Fusion is an amazing tool for data pipelines. Powerful because it can handle as much data you have, taking the advantage of Google Cloud features, and impressively easy to design the workflow. Another great feature is the possibility to create your connector, as it is based on CDAP, you can develop your connector and deploy on HUB.
Data Fusion is GA since last week and has a lot of big customers already using. It is just the beginning of this incredible tool.

As usual, please share this post and consider giving me feedback!

Thank you so much!

Top comments (0)