DEV Community

Cover image for Redshift vs Snowflake vs BigQuery: Choosing a Warehouse
Team RudderStack for RudderStack

Posted on • Edited on • Originally published at rudderstack.com

Redshift vs Snowflake vs BigQuery: Choosing a Warehouse

In our previous blog post, we described our experience setting up an Open-Source Analytics Stack using RudderStack, Apache SuperSet, and Redshift. When it comes to choosing a data warehouse for the open-source analytics stack, it is very common to pit data warehouses against each other. Redshift vs BigQuery, Redshift vs Snowflake, etc. are some common examples.

In this post, we will go into the reasoning behind choosing AWS Redshift over other popular data warehousing solutions such as Google BigQuery and Snowflake.

Background

AWS Redshift and Google BigQuery are market leaders for cloud-based data warehouse solutions available today. As they both utilize columnar storage, they offer improved performance over traditional RDBMS. These platforms provide mature, intuitive management consoles and a rich set of programmatic interfaces. On the other hand, Snowflake has emerged as a modern data warehouse solution with features such as ease of access, instant scalability, as well as powerful analytics and security.

For this comparison, we used a data set that consists of approximately 20 million event records, triggered directly or indirectly by the users of a mobile casino game. The mobile application contains multiple games - both free and paid - that mimic the casino slot machines.

Data for comparison

We used the data from the following three tables:

  • revenue - details for purchase events
  • spin_result - details for events triggered by automatic or manual player spin action
  • tracks - parent table containing common details of all events

Evaluation Methodology

We compared Redshift vs BigQuery vs Snowflake on two major aspects - performance and cost. For performance evaluation, we ran a set of 11 queries that were used extensively by the product managers of the gaming application on each of the data warehouses. As the pricing model varies across the three data warehouses, we also made an attempt to create a common cost evaluation standard based on usage.

Some other considerations that guided this performance comparison exercise were:

  • Data loading time was not considered as part of this exercise.
  • Performance comparisons were done using two Redshift cluster configurations (one with single dc2.large instance and another with two dc2.large instances). Higher configurations were deemed to be too costly for the given setup as discussed later.
  • All the queries were executed from a Python program. The module used for accessing Redshift was psycopg2 while we used the google.cloud module for BigQuery. For Snowflake, SnowSQL i.e. the Snowflake Command Line Interface was used
  • Multiple readings were taken to balance one-off factors such as network disruptions.
  • No platform optimizations, vendor-provided or otherwise, were implemented during the exercise.

Cost Comparison: Redshift vs BigQuery vs Snowflake

When comparing the cost of data storage for running the queries, the following are the key takeaways for each data warehouse:

AWS Redshift

  • For dc2.large with 160GB/node storage, the cost is $0.25/node/hour, that is $180/month + no cost for processing queries.
  • As we use two dc2.large nodes for each cluster, the total cost comes up to $360/month.

Google BigQuery

  • BigQuery charges $20/TB/month for storage and $5/TB for queries.
  • The total cost of running the 11 queries (only once) came up to roughly around $0.56.
  • The size of the dataset being queried is 39.30 GB, that is, 0.0393 TB. Therefore, the storage cost per month came up to around $0.79.

Snowflake

Snowflake offers different kinds of platforms that can be opted for, as per our usage. For this exercise, we used the Business Critical platform for our data storage. It also offers two types of services:

  • On-demand: Snowflake charges $40/TB per month for storage. Although our total cost of storage comes to $40 * 0.0393 TB = $1.57/month, there is a $25 minimum monthly charge for on-demand customers.
  • Pre-purchased: Snowflake offers a pre-purchased plan of $23/TB/month for data storage.

For querying, SnowFlake Credits are used to pay for the consumption of the cloud services that exceed 10% of the daily usage of the compute resources. The total number of credits used for executing the 11 queries is 1.866761388. Given the cost per credit for a business-critical platform is $4, the total cost of querying comes to $4*1.866761388  = $7.5.

Key Takeaways

  • If the queries are run just once in a month, Redshift is approximately 267 times more expensive than BigQuery.
  • When it comes to storing and querying data in BigQuery as compared to Snowflake, the cost of storing and running the queries once on the former ($1.35) was much less than doing so in the latter ($7.5).

Performance Comparison: Redshift vs BigQuery vs Snowflake

The following table shows the performance comparison when it comes to Redshift vs BigQuery vs Snowflake, as per our exercise:

Query Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (1 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) Redshift (2 dc2.large) Snowflake Snowflake Snowflake Snowflake BigQuery BigQuery BigQuery BigQuery
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
1st Query Execution Time
(seconds)
Average of Remaining 4 columns 
(seconds)
Median  of Remaining 4 columns 
(seconds)
Standard Deviation  of Remaining 4 columns 
(seconds)
Q1 17.66056 0.2816 0.2500 0.6006  4.39874 0.5031 0.2509 0.4390 4.352 1.5266 0.387 1.591 2.08194 0.6006  0.4390 0.0519
Q2 20.70354 0.2637 0.2534 0.0295 3.93964 0.5189 0.2827 0.4307 1.733
0.6088
0.333 0.5621 0.68288 0.5473 0.5465 0.0422
Q3 17.46567 0.2683 0.2482 0.0415 3.91162 0.2465 0.2462 0.0015 1.187 0.5088 0.349 0.339 0.60482 0.5309 0.5229 0.0481
Q4 29.12982 0.5341 0.5384 0.0365 4.70708 0.7002 0.6641 0.1830 1.784 0.6556 0.399 0.5649 0.63746 0.5203 0.5173 0.0282
Q5 7.10846 0.2470 0.2471 0.0048 3.58540 0.2472 0.2473 0.0010 1.847
1.0298 1.267 0.5730 0.70232 0.8076 0.7428 0.2236
Q6 2.50200 0.26749 0.2518 0.0369 3.47770 0.4200 0.3334 0.2030 4.598 1.4046 0.354 1.6529 0.55107 0.6096 0.6158 0.1063
Q7 2.47661   0.2643 0.2501 0.0395 3.42400 0.2928 0.2860 0.2030 0.365 0.3472
0.365
0.0287 0.63842 0.5221 0.6105 0.1585
Q8 2.52171 0.256 0.2416 0.0285 3.59205 0.2606 0.2501 0.0204 0.853 1.0672 1.264 0.423 0.58993 0.5653 0.5460 0.0448
Q9 11.16986 0.2612 0.2556 0.0124 4.63495 0.2547 0.2497 0.0116 9.727 2.7792 1.346 3.4952 0.64536 0.5552 0.5297 0.0473
Q10 28.49286 0.2741 0.2750 0.0340 10.89786 0.5594 0.2518 0.5352 7.381 2.1494 1.240 2.646 0.54803 0.5445 0.5492 0.0204
Q11 8.57887 0.8989 0.8791 0.0368 14.84757 0.6693 0.5452 0.2285 2.992 0.89 0.358 1.051 0.61591 0.6673 0.6370 0.1444

Key Takeaways

  • In terms of query execution time, the first readings of Redshift with one node are much higher than both Snowflake and Redshift with 2 nodes.
  • Redshift seems to cache data while BigQuery does not, by default

Conclusion

Based on the performance and cost evaluation of the three warehouses, we see that using Redshift as our preferred data warehouse solution makes sense. Considering there are large numbers of real-time or near real-time dashboards that are being accessed by a high number of users, Redshift gives us the best RoI on the performance and cost aspects.

For a detailed appendix that captures the actual times recorded during the execution of queries, please click here.

Editor's note: This article was originally published in December 2019. It has been completely rewritten for accuracy.

Top comments (0)