Just in case you thought it was all over, re:Invent 2020 continued the second part of the show this week, albeit in 2021. It's been that kind of year for most of us.
re:Invent Part II was 3 days of feature packed talks, finishing up on Friday 15th. I am a data engineering manager with a keen interest in serverless. This article summarizes the talks that I watched with regards to data services on AWS. My team and I are heavy users of Redshift so we were very happy with the amount of Redshift talks and references in re:Invent.
Design best practices for architecting a data-to-insights solution on AWS
Adnan Hasan does a very good job in this level 200 talk in showcasing the updated AWS data ecosystem and how different architectures can be put together to create scalable systems. If you think you've seen it all before, you haven't. This is an updated view of their products with new features leveraged to augment previous patterns.
Adnan introduces and expands on the principal of "in-place analytics with minimal data movement" presenting 4 designs that could be applied to meet this principal. Interesting for me was to see Redshift being used as a data virtualization tool, allowing users to read data from both S3 and operational databases.
He then continues with best practices for an analytics and BI pipeline and how personas can be used to break up your system into distinct parts within corresponding security contexts.
Adnan closes with a few pointers on cost optimization. In summary it was a very interesting talk and a good place to start before progressing into the lower level talks below.
The lake house approach to data warehousing with Amazon Redshift
Vinay Shukla dives deeper in this level 300 talk on where Redshift fits in the lake house paradigm that has become fashionable lately.
He expands on the subject of personas used in Adnan's talk, suggesting that you should use them to drive your choice of analytics engine. This is a topic close to my heart and one that the cloud pay-as-you-go model makes a reality. Vinay explores 4 common data use cases and suggests patterns that could be applied to each. He also covers security and shows how different access levels can be applied in practice to separate personas using AWS Lake Formation.
Vinay closes with a slide each on how recently released Redshift features can be used to support a Lakehouse architecture. These are Redshift Spectrum and Lake Formation integration, data lake export, bloom filters, cost controls for concurrency scaling and Redshift Spectrum, support for reading Apache Hudi and Delta Lake tables, federated query, materialized views on external tables and Redshift ML.
Getting the most out of Amazon Redshift automation
Paul Lappas gives a level 200 talk on the DW automation features that AWS has released for Redshift. The talk focuses on background automation features that can improve the performance of your cluster while reducing the operational overhead on your IT team. Paul shows how numerous features can help automate performance tuning.
The talk builds around the new Automatic Table Optimization (ATO) features of Automatic distribution and sort keys. Redshift now has the ability to detect the most suitable columns to use as partition and sort keys for your table. By enabling this feature, you are handing this decision over to AWS and out of the hands and minds of your database developers. Redshift profiles the queries that are run against the tables to choose the best columns. All decisions are logged within a system table so that an audit trail can be monitored. Combining these two features with the existing features of automatic vacuum and table sort provides an extremely powerful data-driven solution to what can be a time-consuming decision for data developers.
Paul also spends time on how Redshift Advisor can be used to improve data ingestion, query tuning, table design and cost. A common reference architecture is shared and he shows how automation can make operational analytics and BI reporting easier and faster within the context of this architecture.
Paul finishes with a hands on demo of the new ATO features and how they can improve query performance automatically for your end customers. An important point was made that this feature will primarily benefit customers who run adhoc queries on the cluster. ETL workloads are more well defined whereas customer queries can consist of multiple flexible access patterns. By allowing Redshift to automatically decide the distribution and sort keys of your tables, your most common patterns can be better served with this data-driven approach.
As a long time Redshift user, I can vouch for the improvements that AWS has released over the years to reduce the operational overhead on data teams. For example, we had built our own jobs to analyse and vacuum tables in the past. With the automation of these features from AWS, we can now deprecate these jobs and hand this operation over fully to AWS. This reduces the amount of code and support we need to carry in the team.
Deep dive on best practices for Amazon Redshift
Harshida Patel gives a deep dive level 400 talk on best practices for Redshift. She sets the foundation for her talk by reviewing the fundamental architecture of Redshift, discussing how to connect to leader node and how the leader node coordinates activities across the compute nodes. Harshida then shows how Spectrum works and is aligned to the compute nodes of each cluster. From there we see how the architecture of the RA3 nodes differs from the traditional node types and then how AQUA will accelerate read performance of RA3 nodes even further.
With the baseline set, Harshida gives an overview of the new features including data-sharing, Redshift advisor and Automatic table optimisation.
One thing I noticed here is the upgrade in the amount of storage now available per RA3.4xl and RA3.16xl nodes is now 128TB. This used to be 64TB until very recently. This is an interesting upgrade.
Harshida's recommendations are then grouped into the 4 fundamental areas of Table design, Data ingestion, Workload management and Cluster resize. She dives into the concepts and terminology relevant to each area with the best practices then summarized onto a single slide for each.
This is a great talk, bringing a consolidated focus on how to boost the performance of your cluster. Harshida gives clear, actionable recommendations that you can make to your cluster. I would definitely recommend this talk if you are operating a Redshift cluster in production.
How to build a test framework for your data lake
Marie Yap and Hemant Borole team up for this level 300 talk on the why and how of building a framework to test your data lake. Marie goes through the different steps in data testing, namely schema validation and data quality and gives approaches for both. She then continues looking at when is the best point in your data pipeline to execute functional and performance tests versus load and verification tests.
Marie finished her talk with a look at rightsizing your Glue jobs for executing tests and how this will affect the performance and cost of running such tests.
Hemant then walks through a very detailed implementation of test pipelines for EMR, Redshift and Glue. The actual tests are stored in DynamoDB and the framework utilizes a number of AWS Step Functions state machines to coordinate the tests. Tests are executed using the open source Deequ framework by AWS Glue.
Hemant then provides a link to a GitHub repo where the Step Functions ASL code is available for adaption and utilization in your own systems. It looks like a great resource to get started.
I loved this talk as it's something my team is currently investigating. The level of detail that Marie and Hermant go to is great if you are working with this problem. If you are interested in finding out more about utilising Deequ in AWS for data testing, check out this article on the AWS blog https://aws.amazon.com/blogs/big-data/building-a-serverless-data-quality-and-analysis-framework-with-deequ-and-aws-glue/.
Understanding state and application workflows with AWS Step Functions
Not strictly data related but if you're interested in Steps Functions for orchestrating your data processes, this level 400 talk could be of interest. Rob Sutter does a great job summarizing improvements to the Amazon States Language (ASL) that is used to generate individual Step Functions state machines. He also shows what happens in each state change of a single step, how to handle errors with retries and catchers and closes with how to run steps in parallel up to a set concurrency.
If you're looking for a summary of re:Invent Part I, Helen Anderson does a great job of it here https://dev.to/aws-heroes/the-aws-re-invent-sessions-i-m-looking-forward-to-387m.
And if you want to learn more on Redshift, these two talks from December are great place to start exploring other new Redshift features such as the snowflake melting data-sharing, multi-az (finally) and others.
New use cases for Amazon Redshift
What’s new with Amazon Redshift
There is a wealth of resources available on the re:Invent site that will be immensely useful to anyone developing on AWS. I find the 30 minute format just right for me and having easy access to the slides is also great. If you're looking for more granular control of playback speed, check out Jeremy Daly's chrome extension. It's very useful when you need something between 1x and 2x.
The data sessions linked in this article provide a great starting place with Redshift and also an update on recently released features. Personally, I am very excited to see the level of investment in the service. The upgrade to RA3 has made possible some fundamental changes to the legacy Redshift architecture and I hope we see more as the 2021 progresses.