Welcome to the R.A.G, a guide about Amazon's Redshift Database written for the Analyst's out there in the world who use it.
Redshift works by using Clusters, Nodes, Slices, Blocks, Caching, and Workload Management.
Let's first run through what these all are at a top level and then I'll run through how they all work together.
You hear 'Cluster' a lot in regards to Redshift but it's really not that important for you to stress over. 1 Cluster = 1 Redshift. A large majority of companies will just have one Redshift Data Warehouse, ergo just one cluster. A Cluster just refers to the cluster of nodes within it. That's about all an analyst needs to know. :)
Every Redshift Cluster has two types of nodes. It has only one Leader Node, and anywhere between 1 to 128 Compute Nodes, though in a single node cluster that singular lone node is both the Leader and the Compute.
The Leader Node is the manager. All requests and actions go through the Leader Node who takes a look at what's being asked of it and works out the best way to execute the task across it's Compute Nodes.
Compute Nodes, as mentioned in the previous article, are discrete units of both processing and storage, and so bits of tables will be stored on these nodes.
This is, in part, how Redshift can work so fast, with processing and storage split out amongst several nodes, each node is doing a smaller part of the whole job.
Slices are where the actual data is stored on the nodes. Compute Nodes can have 2,16, or 32 slices.
Data is stored on the slices from the bottom of the slice up. So if you build a table with at least one sort key, it will also store that data on the slice in that sorted order. Slices can contain thousands of bits of thousands of tables.
Slices are broken down into blocks. Each block is 1MB in size. You could imagine that one cell in a table is a Block, however most data that you would find in a cell would only make up a few bytes in size and so Redshift will attempt to fit as much data as it can inside a 1 MB Block.
This is also one of the things that can make Redshift so quick. Redshift stores, in a meta-data store, the minimum and maximum value of the information inside a Block, which allows the Compute Node to skip blocks of data that it doesn't need without actually looking at the underlying data.
This is where compression, something we will over later, really can help. The more data you can fit into a Block can, with the right sort key, tremendously speed up how quickly you'll get your results.
Redshift has two kinds of caching. Execution and Results.
When you first run a query the Leader Node will construct an execution plan, which is it's best guess of how to run this query efficiently based on what it knows about where and how the data is stored. It will then store that execution plan so that if you re-run the query again in the future, the Leader Node doesn't have to make a new plan, instead will just use the one stored in cache, thus saving time. This can mean that your first run of a query shouldn't be used as a benchmark.
Redshift will also store the results of a query in short term memory, so if you re-run the same query or someone else runs the same query then Redshift will provide the results stored in the cache rather than running the query. This can be really useful when querying result sets in the billions and then you accidentally close your session.
Workload Management or WLM is how Redshift manages it's resources.
WLM decides how many people can run a query at the same time and can determine how much CPU or RAM you can use up in your query.
This is all managed through user groups and queues.
Depending on how your Redshift Cluster is set up, this may impact you. It's not uncommon to have queues dedicated to super user tasks or for systems that feed into or feed from Redshift and so you may end up with a conservative amount of resources left over for you and your fellow analysts.
What this can mean is that WLM prevents you from really experiencing how fast Redshift can be as your query may have to wait until another query has finished running OR it may be killed by WLM for consuming too much of the allotted resources.
However it's not all doom and gloom. Redshift out of the box has a 'small queries queue' which is it's own separate queue that is just for 'select * from table limit 50' queries, meaning you don't have to wait for a colleagues large query, or an ETL load before you can get your results.
Redshift can also be configured to manage workload on the fly, not needing DBA's to set any hard rules, instead relying on Redshifts own Machine Learning to predict incoming query run times, and assigns them to the optimal queue for the faster processing.
Fundamentally, your results may vary.
Lets work through some hypothetical scenario's with my imaginary, above pictured, Redshift Cluster with four two-slice nodes. In my imaginary Redshift cluster is a table containing cars makes and models.
Say I tell Redshift to store all the white cars on Node#1, red cars on Node#2, yellow on Node#3, and black on Node#4. If I run a query that wants all the red and all the yellow cars then that query is going to go to the Leader Node, which knows which color cars are on which nodes and so it only communicates to Node#2 and Node#3. Bam! it all came back super quick, and Node#1 and Node#4 where never disturbed.
However If spread out my cars by color, and then ran a query looking for all Audi's then the Leader Node is forced to ask every node to 'have a look' to see if they have any Audi's. As you can imagine this wouldn't work so fast.
Lets continue with the cars being spread by color and lets also say that we have sorted the car data on the nodes by the year they were first manufactured which means the blocks in the slices are ordered from the bottom up. If I run a query looking for all cars manufactured after 2000 that were not black, that request will again go to the Leader Node, who knows to exclude the fourth node. That request then get passed to the Computer Nodes who know from the background meta-data tables that cars manufactured after 2000 don't start until 60% up the slice, and so immediately jump to the top end of the slice and start reading values to return. Again an example of how this architecture can speed things up.
However, if each row also had a decommissioned date, and we wanted all the cars that where decommissioned between 1980 and 1995 were not black, we still get that speed increase from the Leader Node knowing to avoid the fourth Node, however the Compute Nodes are forced to scan every block in their slices for values between 1980 and 1995 because we sorted it instead by Manufacture Date, now you can define multiple sort keys, but I'll cover that off in a later article.
Take Away: Redshift works by utilizing an architecture which when used properly can yield impressive speeds across billion fold data set. However you, the analyst, may not find this to be the case every time. If you have to run a query which runs counter to how the table is spread across nodes or not how the data is sorted on the slice then the results may be slow to come back. You may be limited, or outright have your query cancelled, by WLM.
It's important to stress this as all the flashy advertising and company success stories mask a reality where people who actually use Redshift may get frustrated at it's less than advertised speeds. The data can't be configured to suit everyone's queries, more often than not a data set is going to get configured in a way that benefits the most likely query scenario, which will be a guess and will leave a lot of other queries running a bit slower.
header image drawn by me