Continuing the Big Data topic, I want to share with you this post about Google Cloud Data Fusion.
The foundation
Cloud Data Fusion is based on Caskâ„¢ Data Application Platform (CDAP).
CDAP was created by a company named Cask and this company was bought by Google last year. CDAP was incorporated into Google Cloud and named as Google Cloud Data Fusion.
Data Fusion
Data Fusion is a fully managed CDAP with steroids 🧬.
From the Google Cloud page:
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines. With a graphical interface and a broad open-source library of preconfigured connectors and transformations, Cloud Data Fusion shifts an organization’s focus away from code and integration to insights and action.
On important info about Data Fusion is it rely on Cloud DataProc(Spark), handling the cluster (create and delete) for you.
The game-changer of Data Fusion is the amazing graphic interface providing for the user an easy to use way, to create from a simple transformation pipeline to the complex ones. The best: without a line of code.
This post will detail the first 2 options: Wrangler & Integrate
At the end of this post, you can watch a video tutorial where I show a step-by-step using Data Fusion.
Wrangler
Wrangler is the center part to prepare your raw data. You can upload files or connect to a variety of external sources like databases, Kafka, S3, Cloud Storage, BigQuery and Spanner.
Right after you choose your source, you are redirected to the parsing mode.
As you can see at the top, there's a tab called insights. There you can see some useful graphs about your data:
Studio
On the Studio, you have all the great tools to create your data pipeline. The source, the transformation, and the sink. Each one of them with a diversity of choices. Take your tool!
The main page of Studio, on the left you have the Source, Transform, Analytics, Sink, "Conditions and Actions" and "Error Handlers And Alerts". The gray area where you design your pipeline.
A simple but complete pipeline:
All available tools in the Studio:
You can also install more sources and sinks, in the Hub:
Pipeline
After design your pipeline on the Studio, you need to deploy it. Just click the "Deploy" button, then you can see your pipeline on the next page. On this page, you'll be able to run the job, watch the logs, configure Schedule and also see the Summary of executions.
At the top in the center, click on "Summary" to check some facts about your jobs.
Step-by-step
This step-by-step was recorded and edited by me(sorry any issue).
You'll be able to see a complete design and execution of a pipeline. The steps of this pipeline are:
- Ingestion a CSV
- Parse the CSV
- Prepare some columns of the CSV
- Use the Join Transformation tool
- Connect to a PostgreSQL(CloudSQL) instance
- Get information about the States of Brazil
- Join it with the correspondent column of the CSV
- Output the result in a new table on BigQuery
Conclusion
As you can see, Data Fusion is an amazing tool for data pipelines. Powerful because it can handle as much data you have, taking the advantage of Google Cloud features, and impressively easy to design the workflow. Another great feature is the possibility to create your connector, as it is based on CDAP, you can develop your connector and deploy on HUB.
Data Fusion is GA since last week and has a lot of big customers already using. It is just the beginning of this incredible tool.
As usual, please share this post and consider giving me feedback!
Thank you so much!
Top comments (7)
It's a great tool but very expensive if we want to create few pipelines once and let them run daily.
Is there any possibility to reduce pricing ? As I understand the Fusion instance must be run 24/7 to be able to execute the scheduled pipeline on daily basis.
Hi there!
Thanks to reading my post :)
Answering you, yes it is expansive. The focus of this product is big/giant companies.
But here you can get a tip: go to GCP Marketplace install the CDAP with "Click to Deploy" option.
The opensource and package version available there can do almost all the options that you have on Data Fusion. The best part: the cost is only for the server running the CDAP and for your Dataproc cluster.
Thank you!
Hey Giuliano,
Thanks for this insightful article.
As I was telling, price is prohibiting small & medium company using it just for daily usage unfortunately and who would prefer using 3rd party solution like Segment and others (of course it has less features). So it's pity that GCP could do offer a special package for such audience and use case.
Using CDAP from marketplace is actually a possibility but not serverless.
I was wondering if a trick like saving and exporting the pipeline to swtich off the instance and then daily create an instance import the saved pipeline, execute it and close instance after, could be done ?
So far, I couldn't find a possibility to do it unfortunately.
Keep me informed if you by any chance you do.
Hi!
I'm currently searching for a serverless solution for ETL transformation and I was thinking in GCP Data Flow but pricing is restrictve for us.
Our basic requirements is to read a json file from an API which returns 4000 objects, do data transformation to objects and call an API on destiny for data import.
It's not possible to swith of Data Flow instance as you asked, right?
Regards
Hi,
DataFlow is really not the tool for such load, it concerns much higher volume.
Probably Google Cloud Function could be an cheap option, depending of your data transformation.
PS: My question was about Google Cloud Data Fusion, which is anyway not appropriate for your use case.
Hi!
Thanks for the reply. Definitly GCP Data Fusion is not the use case for my data integration requirements.
I tried to say Data Fusion instead of Data Flow, sorry for that, I'm reviewing too much tools that I mispelled.
Regards
Hi Giuliano,
thanks for the informative blog post and the tip about GCP marketplace. I've managed to create the server for CDAP but do you have any info about how to provision the Dataproc cluster to include the server running CDAP. It seems that without running the plugin on a Dataproc cluster, the process of authenticating access to BigQuery and other Google Cloud sources is more complicated.
Thanks!