For those of us that have to manage large data sets across massive organizations with diverse needs, managing your own set of servers and applications has proven to be beyond difficult.
Thus the rush to the cloud.
Google and Amazon are some of the largest providers for cloud services! Many companies rely on their various products and often mix and match the different products they offer.
In this article, we wanted to discuss the concept of cloud data warehousing.
Specifically referring to RedShift and BigQuery --- built by Amazon and Google respectively, and discuss which application is the best for your organization. So let's begin:
Before we get too far into the specifics, what is a data warehouse?
Simply put, a data warehouse is the life blood of any Business Intelligence or analytics team. It allows these teams to store and analyze useful data from domains across an organization (like finance, operations, and HR).
It also often makes data simpler to interact with due to design changes made from the original source system and the data warehouse.
Overall, the main aim of a data warehouse is to bridge the gap between the complex business processes and the analysts.
Up until recently, data warehouses were developed on-premise. This means the systems were limited to the server they lived on and often had to scale vertically.
This could take a lot of time and extra human hours.
Now with the cloud, many data warehouse providers allow you as a company to scale up and down as needed.
In addition, many of these cloud providers have developed data systems that are data warehouse specific. This has made them even faster and more efficient compared to standard relational databases.
There are many options for how a team can approach using cloud data warehouses. Today we are going to focus on BigQuery and Redshift.
Basically, Amazon vs. Google.
So let's take a look.
BigQuery is a serverless enterprise-level data warehouse built by Google using BigTable.
This application can execute complex queries in a matter of seconds on what used to be unmanageable amounts of data.
BigQuery supports SQL format and offers accessibility via command-line tools as well as a web user interface. It is a scalable service which allows user to focus on analysis instead of handling the infrastructure.
Personally, I really enjoy the online web UI that BigQuery has. No need to set up any connectors or download any third party tools to interact with the data.
Redshift is a column-oriented cloud-based data warehouse system built by Amazon. Some say it was so they could stop relying on Oracle after the CEO of Oracle boasted about Amazon needing Oracle to stay in business.
Redshift cluster comprises of multiple machines which store a fraction of the data.
These machines work in parallel, saving data so we can work upon it efficiently. Here, Redshift has some compute nodes that are managed by leader nodes to manage data distribution and query execution among the computing nodes. There are also other design benefits like Massive Parallel Processing (MPP).
Overall, both BigQuery and Redshift were designed with analytics in mind. So concepts like MPP and columnar storage were both design decisions made to make sure running analytical queries were efficient.
Although both have on-demand and flat-rate options, BigQuery and Redshift differ significantly in terms of pricing. BigQuery charges for storage, queries, and streaming inserts while Redshift charges for each node in clusters.
RedShift costs about $306 per TeraByte/month for storage and offers unlimited processing alongside while BigQuery costs just $20 per TeraBYte/month for storage and $5 per TeraByte processing.
RedShift uses Amazon IAM for identity while BigQuery uses Google Cloud IAM. BigQuery comes with default data encryption option while in the case of BigQuery, you have to manually enable the option.
BigQuery abstracts details of underlying database, configurations, and hardware. Redshift requires you to have a very in depth understanding of redshift. This includes concepts like distribution keys and MPP.
Amazon RedShifts allows you to load data into it from just about anywhere. However, it mostly naturally connects with Amazon S3 while Similar things can be said about Google BigQuery and Google Cloud Storage.
Both systems support data insertion in streaming fashion and both support data serialization in formats like JSON, CSV, and Avro.
Also, there are many other methods you can use to load these systems. For example, both Airflow and Luigi can be used to load but also AWS Glue.
Many traditional data warehouse and BI professionals might be used to merge, update, and insert a set of DMLs (data manipualtion lanague) statements that are often found in Oracle, SQL Server, MySQL and pretty much every other standard database.
However, both Redshift and BigQuery don't necessarily support all these clauses in the same way.
For example, for merge Redshift has upserts, but it's not exactly a straight forward version of merge.
To quote AWS:
Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert) to insert and update data from a single data source.
However, you can effectively perform a merge operation. To do so, load your data into a staging table and then join the staging table with your target table for an UPDATE statement and an INSERT statement. For instructions, see Updating and Inserting New Data.
So this is really a work around. Truthfully, this still makes me a little sad.
BigQuery actually does support the merge clause.
In fact, up until about 2017 it was really designed as an append-only system. But BigQuery has since gotten update, insert, and merge clauses.
This provides a much more straight-forward approach for the engineers developing their data warehouses on top of Bigquery.
Continuing with the idea that BigQuery is just easier to use.
Both Redshift and BigQuery are excellent data warehouses that help businesses in gaining useful insights. Before you make your selection, it is important to consider your table column lengths, business requirements, and technical talent.
Redshift offers a lot more flexibility in how you manage your resources. However, to operate a cluster, you need to understand many of the nuances behind Redshift. This can force your engineers to spend a lot of time fine-tuning your data warehouse.
On the other hand, BigQuery doesn't expect you to manage resources and therefor it abstracts away all underlying configuration, hardware and database details. Owing to its SQL-like infrastructure, it is very user-friendly and easy to learn.
Although there are some differences both come with their own strong points. So go through the above article and then compare to your organization's requirements to make the best choice.