Redshift is a solution for big data warehousing from Amazon. Once you’ve created a data warehouse, called a cluster, you can use SQL tools to perform analysis fast. Really fast.
The data is stored in a relational database and is queried using SQL as a query language. Queries are optimised and enhanced before being distributed among nodes. If you’re familiar with SQL you won’t need to learn anything new as the interface is based on PostgreSQL and can connect to many BI tools.
When a query is executed, it is distributed among nodes. The nodes get the results in parallel and present a single consolidated result to the leader node. Clusters can be scaled up and down as required meaning costs can be kept under control depending on demand.
Distribution keys are used for performing efficient joins between tables. These keys ensure joins are always on the same node, minimising performance losses.
Choosing the right distribution key style will have a huge impact on query performance. Distribution Keys can't be changed once a table is created.
Data is distributed based on a column you choose. IDs are good candidates for a distribution key as they are often used in WHERE and JOIN clauses.
The rows of a table are distributed in a 'round-robin' fashion, regardless of the values in the columns. As a result, you will not get good performance when filtering or modifying a large data set.
The entire contents of the table is distributed to each node in the cluster. This is fine for small tables but should be used with caution.
The data is stored column-wise, as opposed to row-wise, supporting high compression and in-memory operations. Columnar data stores are effectively an index for every column. Data of the same type with similar values are organised next to each other. Unlike other database solutions, the query doesn’t have to scan through every row and produces results much faster.
This post first appeared on helenanderson.co.nz
Photo by Bryan Schneider on Pexels