High level overview of Amazon Redshift

#datascience #serverless #cloud #aws

Amazon Redshift is a database, but also an analytics engine.
So Redshift is based on the PostgreSQL technology, but instead of PostgreSQL, it's not used for online transaction processing.
It's actually an all-app type of database which means online analytical processing (OLAP)and it's used for analytics and data warehousing.

So it has a 10X better performance than any other data warehouses out there and it scales to petabytes and petabytes of data.

So the idea is that you would load all your data into Redshift and then very quickly you can randomize it from within Redshift.

So the Redshift has good performance improvements because it's actually using a columnar storage of data instead of row based, and it has a parallel query engine.

You pay as you go for all the instances you provision in your Redshift cluster and to perform your queries, you can use just directly some SQL statements.

So any business intelligence tool such as Amazon QuickSight or other ones such as Tableau, integrate with Redshift.

And so if you had to compare Redshift and Athena, Redshift, first you have to load the data.
Sometimes from Amazon server, you have to load all the data into Redshift and then Redshift says, yeah, well, if it's loaded in Redshift, Redshift is going to have much faster queries.

Also, Redshift can do much faster joins and much faster integration because Redshift actually has something that Athena does not have.

Redshift has indexes and it builds indexes to have this very high performance for a data warehouse.

So if it's just an ad hoc query on Amazon S3, then Athena is going to be a great use case, but if it's like intense data warehousing with many queries and they're complicated, there are joins, aggregations, and so on, then Redshift is going to be a better candidate.

So your Redshift cluster has two things:
- Leader nodes: they do query planning and results aggregation.
- compute nodes: they actually perform the queries and they send back the results to the leader.

And because it's a Redshift cluster, you have to provision the node size in advance.
And if you wanted to do there for cost saving, you could use reserved instances.

So your Redshift cluster has a leader node and then some compute nodes and then you would submit a query in the SQL form to the leader node and the query would happen in the backend.

A cool feature of Redshift is Redshift Spectrum.
The ideas that you would have data in Amazon S3 and you want to analyze it using Redshift, but you don't want to load it into Redshift first, and on top of it, you want to use a lot more processing power, so you use Redshift Spectrum and you must have a Redshift cluster already available to start the query, and then once you start the query the query will be then be submitted to thousand of Redshift's spectrum nodes that will perform the query onto your data in S3.

So this is it for an overview of Redshift.

GitHub
LinkedIn
Facebook
Medium

DEV Community

High level overview of Amazon Redshift

Top comments (0)