DEV Community

Cover image for Do We Need Spark? Or Can It Be Replaced With DBT? (Google BigQuery + DataProc Rant)
bronifty
bronifty

Posted on

Do We Need Spark? Or Can It Be Replaced With DBT? (Google BigQuery + DataProc Rant)

Hadoop is the C in CQRS; it is a function which takes raw data & produces a single result (an 'aggregate) such as sum,min,max,mean,mode; the results form a 'aggregate' tables; agg tables are joined at their common edges to define a 'fact' table, which is queried from 1 endpoint
all of this behind the scene work is a reverse proxy to your data
EMR (AWS), HDInsight (Azure), Dataproc (GCP), Cloudera (3rd party), Hortonworks (3rd Party), HDFS (on-prem) these are all Hadoop distributions. 'Spark' updates Hadoop in memory on a very large cluster of master slave nodes or server client architectures like k8s but different
Do you need all these servers running all the time for this jobs or can you just throw the workers inside something like a long-running serverless function such as AWS Fargate, which is a 'serverless' container (pretty neat stuff) or GCP has an equivalent there - CloudRun...
Or can BigQuery do all the things? Excuse me my eye is watering. BigQuery is insane. I'm obsessed.
At some point I'd like to do some quickie tutorial comparing Business Intelligence Development Studio (BIDS) with SSIS SSAS + SSRS to BigQuery with Omni and Lake + Looker Data Studio. But the MSFT stack is infamously difficult to work with and is not open source.
If someone knows BIDS well and/or MSFT and wants to throw some shade at me please do. But until I have better info I'm just saying it's difficult to maneuver the platform licensing lock-in and everything with the hoops to jump through for a non-enterprise hobby dev
BigQuery + Omni & Lake is like db that queries heterogenous sources as one (a data fabric or mesh) + Looker Data Studio is the view layer on top (frontend); SSIS is for transformation. DBT SaaS is something i'd like to look at for that. A jinja pythonic templated SQL task runner
DBT is like the vite (task orchtestrator) and jinja pythonic templates are like esbuild (the compiler) in that DBT has a scheduler (among other things) and jinja takes multiple values and transforms them whilst maintaining state to a reduced singular truth
a fun tut might be creating data in a todo app saved in mysql then streaming it from mysql to bigquery, which is used as an analytic source for a dashboard spa on another page but in the same site as the todo app - like on the profile tab (let's analyze your todos)...
i'm sure we can figure out a way to integrate the Tanstack Table in there as a reporting layer for the BigQuery data.
While I'm comparing things let me reiterate java is dog water. In node we have npm - node package manager - & imports/require dependency. In java you have jar file (like a node module dependency) but what manages this? maven? ant? gradle? they have nothing to do with each other!

we have npm yarn pnpm they are all at the root related to npm. java has no such thing. it is a complete and total overengineered chaotic disaster. a dumpster fire. and yet all these hadoop frameworks are built on it... we must re-engineer all of this or find an alternative
I know Google likes JVMs but they also have Golang, which is better.
Jason is showing us how to run a serverless Dataproc Spark job in BigQuery (BQ is like a SSMS interface to run a SSIS job) and the java function required for this involves passing whole jar filepaths as dependencies in the parameters which is the worst of all possible worlds...
Deno is a pain because you have to refer specifically to a dependency which is centralized, but it's safe. NPM is unsafe, but at least you can import and refer to a shorthand. Java has NEITHER! You must manage these dependencies on your own filesystem. Ridiculous
Needless to say writing artisinal bespoke codes in a java function that uses nested pythonic templates to handle io in dataframes (the equivalent of a sql table) is definitely like the JQuery imperative to a React/Angular/Vue declarative standardized framework.
DBT handles all that ornamentation for you. So you can just write SQL. And keep variable state in Python. I'm not sure if you even have to mess with Python much I have to look into it further.
feast your eyes upon this state of the artistry called big lake, a metadata extension to the data fabric providing command and query capability to google data engineering products like bigquery.

https://youtu.be/iQilh-PHDvs?t=215

in it you will see Jason write a serverless function to perform Spark job against dataproc (the Hadoop stack - essentially a filesystem (eg GCS or BigQuery native) and metadata (HIVE) - with io tools like Python wrapped in a Java function distributed on GCP) inside BigQuery...
notice he writes the function in Java to communicate with underlying Spark/Hadoop infra; db connexion is jdbc protocol; data loaded in memory via Python; data then transformed via SQL query; finally data migrated to new memory bytes in Python & dumped back to file sys via Java
It is absolutely state of the art because of the NFLX Iceberg integration imho.
I don't think the java hadoop spark in dataproc part of it is state of the art, however. I think that part can be done much more easily and in a standardized fashion. But honestly it's not my area...
So I am really just shooting from the hip here. But much much much more research forthcoming on the topic. We will get to the bottom of this. Until next time adventures in anti-java drama land. Node ftw. We're moving to Fastify. We will do all the transformations with javascript
just kidding
Actually nope I'm not done here. What's the use case of Spark over just using Python Pandas dataframes directly? MPP. The same use case as Hadoop. Java has multithreading and Python does not...

You know what Java does NOT support? Overlapping async io. Like Node... Or C
Python is much better for CREATING magic machine learning ML models than Javascript because it's more powerful with data. But Javascript is much better if you already know the algorithm & want to hand over a library to process. Lab scientist in Python vs sys operator in Js
Boto3 has commonly done all the AWS sys admin automation tasks and many people prefer Python for things like Pulumi IaC. But Js is powerful too. And it CAN be the singularity in scripting languages at least as a meme and for actual certain use cases. It will be a good experiment.

P.S. Check this multicloud data export and ingestion manuever with simple SQL commands https://www.youtube.com/watch?v=hNRd6GsXxVE

P.P.S. My mind is exploding. Will delet. A Python notebook is a MD notepad with a REPL. Spark runs a Python lib which executes SQL. BigQuery can call a Python serverless function with SQL.
You may be saying ok what’s the big deal. I’m going to tell you right now what is the big deal. SQL is declarative. Now you’re getting a filtered loop execution just by saying select this. You don’t need to ornament the length, break, condition & iterator. In fact…
When the select is selecting a serverless function [to run per item returned by the table, which will be passed automatically as an argument] it is executing a command inside the query. An implicit command. The command is huimplicit…

https://youtu.be/d5QAKGdmcK4?t=1271

Top comments (0)