Redshift Deep Dive

#aws #redshift #dataengineering #dataanalytics

Some best practices on maintaining good performance on your Redshift cluster(s)

Data Ingestion

Small files instead of large files using the COPY command

Split larger input files into smaller files. This helps use all compute nodes in a cluster rather than just the first one if a single large file is used. ¹
Make the number of files as a multiple of the number of slices in your cluster.

Use INSERT INTO SELECT from external S3 tables

Aggregate the incoming data. Reducing the number of write operations on your cluster helps speed up performance because of less throughput I/O required.
Select a subset of columns and / or rows instead of SELECT *
Manipulate incoming column data with SQL
Load data in alternative file formats such as AWS ION, Grok, Parquet, ORC etc.

Other best practices for Ingest

Save cluster resources for querying and reporting rather than on ELT. Remember that Redshift is designed for OLAP workloads and not OLTP. Performing a lot of write operations reduces its performance, and hence it is always better to have data aggregated so as to reduce the number of writes.
Filtering / aggregating incoming data can improve performance over COPY command.

Redshift is designed for OLAP workloads

Small write operations (~1 - 10 rows) have a similar cost to a larger write operation (~100K rows).
the DELETE command uses logical deletion and since the smallest immutable blocks in Redshift use 1MB of data, it is recommended to use it wisely. ²
Add a vacuum statement (VACUUM) after each delete operation to remove ghost rows. This is normally done automatically by Redshift ³
AUTO VACUUM DELETE will reclaim space and AUTO TABLE SORT will sort needed portions of the table.
If you are aware of your workload times for typical ELT / ETL operations, it is recommended to run VACUUM during off-peak periods as VACUUM is a resource intensive task.
VACUUM BOOST is another option as well, although it is again recommended to run it during off-peak periods because it is resource intensive.⁴
Run ANALYZE, typically after an ingest operation is complete on just the columns that WHERE predicates are filtered on.
AWS has a Redshift utility script to run checks on what tables can be regularly analyzed and / or vacuumed based on certain set of thresholds. This can be run as a part of a monthly / periodical cluster maintenance.

Note: the Python script linked above is an official AWS utility script but AWS does not guarantee the impact against your database, so use it well.

Workload Management (WLM)

Short Query Acceleration (SQA)

Redshift automatically identifies (using proprietary machine learning) short-running queries and runs them in a separate "short query" queue, if queuing occurs. This functionality is ON by default and does not cost extra $$.

Concurrency Scaling

Turning on Concurrency Scaling in Redshift helps reduce query queuing. When a redshift cluster queue is full, newer queries are routed to a transient( spun-up temporary) Redshift cluster. This process takes just a few seconds.
It does cost $$ but for every 24 hours of a redshift cluster's usage, a 1 hour credit for Concurrency Scaling is accrued, which basically means that it should be free for most Redshift customers.
Use WLM to limit ingestion / ELT concurrency to 2 - 3 transient clusters.
To maximize query throughput, use WLM to throttle the number of concurrent queries to 15 or less.
Limitations:
- Doesn't support queries on tables that use interleaved sort keys.
- Doesn't support queries on temporary tables.
- Doesn't support queries that access external resources that are protected by restrictive network or VPC configurations.
- Doesn't support queries that contain Python user defined functions (UDFs).
- Doesn't support queries that access system tables, PostgreSQL catalog tables, or no- backup tables.
- Doesn't support COPY or UNLOAD queries that access an external resource that has restrictive resource policies. External resources can include Amazon S3 buckets or DynamoDB tables. Policies can restrict access to a specific source VPC, source VPC endpoint, or source IP address. In some cases, you might need to remove policies on an external resource. Doing this means that COPY or UNLOAD queries accessing the resource are sent to the concurrency-scaling cluster.
- Concurrency scaling for write operations is not supported for DDL operations, such as CREATE TABLE or ALTER TABLE.
- Doesn't support ANALYZE for the COPY command.
- Doesn't support write operations on a target table where DISTSTYLE is set to ALL.
- Doesn't support COPY from Amazon Redshift Spectrum or Amazon EMR.
- Doesn't support write operations on tables with identity columns.
- Redshift supports concurrency scaling for write operations on only Amazon Redshift RA3 nodes, specifically ra3.16xlarge, ra3.4xlarge, and ra3.xlplus. Concurrency scaling for write operations isn't supported on other node types.

Query Monitoring Rules (QMR)

QMR is an extension of WLM.
It allows the automatic handling of poorly written queries (yes, we’ve all written those!)
It helps assign rules to a query that meets a particular criteria. For example, a rule can be setup to abort a long running query, or skipped.
QMR helps wasteful use of a cluster and also log resource-intensive queries that could be candidates for query tuning for performance improvements.
QMR + Redshift Advisor can serve as a good resource for optimizing usage of your Redshift cluster for optimum performance.
Use QMR rather than WLM to set query timeouts.
It is recommended to the save the superuser queue for administration tasks and canceling queries.

Sources

[1]: Redshift Best Practices

[2]: Efficiently deleting data from Redshift - Stack Overflow.

[3]: Reclaiming storage space in Redshift - AWS Docs

[4]: Vacuum - AWS Docs

[5]: Concurrency Scaling - AWS Docs