Whats New In Data: re:invent Andy Jassy Keynote

#datascience #analytics #database #aws

It's a different experience this year. The chat with my teammates is a mixture of discussion about new features and pictures of good times in Vegas from previous re:Invent conferences.

Andy Jassy has finished the first keynote of 2020 and I was not disappointed. Lots of great new features that we have use cases for.

Here are my favourite data related features announced during the Andy Jassy re:Invent keynote.

Glue Elastic Views

Most data teams and customers I work with have data in multiple places. You might have a CRM system, an accounts system, document management etc. Bringing all this data together and keeping it up to date in a 'single customer view' for analytics workloads is something data engineers spend a lot of time thinking about.

I've used Materialised Views heavily in the past to convert transactional data models into views more suitable for reporting queries.

Glue Elastic views seems to be a great feature where you have data in multiple types of databases and want to apply Change Data Capture (CDC) and Materialised view type functionality.

I cant wait to get hands on with the preview. You can sign up today at https://aws.amazon.com/glue/features/elastic-views/

Quicksight Q

I was already a fan of QuickSight due to the pay per session pricing. It works really well when you consider the minimum user licensing for some other data visualisation tools.

I also like the features for embedding QuickSight dashboards into your applications.

With the newly announced feature of using natural language to ask questions of your data it makes it even easier for end users of your applications to benefit from analytics in a much more consistent and integrated way.

The Q feature is in preview and you can sign up at https://aws.amazon.com/quicksight/q/?nc=sn&loc=4

Check out the blog on QuickSight Q here https://aws.amazon.com/blogs/aws/amazon-quicksight-q-to-answer-ad-hoc-business-questions/

New gp3 EBS Volumes

You can now scale your storage volume performance independent of storage capacity. Oh and it's up to 20% cheaper than gp2.

https://aws.amazon.com/about-aws/whats-new/2020/12/introducing-new-amazon-ebs-general-purpose-volumes-gp3/

Aurora Serverless v2

v2 now claims to be able to scale instantly in a fraction of a second. The scaling is adjusted in fine-grained increments to provide just the right amount of database resources that the application needs.

The preview will be MySQL currently and will have Aurora features like Global Database, Multi-AZ deployment and read replicas.

Babelfish for PostgreSQL

I've seen quite a number of database workload migrations to the cloud. Often these will also include moving from a commercial database engine to an open source engine like PostgreSQL. There are tools like AWS DMS and Qlik Replicate that do a good job of handing the data migration and conversion of data types. What is often is more time consuming is migration of database code such as PL/SQL to the open source equivalent.

Babelfish looks to address the database code migration problem for MS SQL to PostgreSQL migrations.

Babelfish adds an endpoint to PostgreSQL that understands the SQL Server wire protocol Tabular Data Stream (TDS), as well as commonly used T-SQL commands used by SQL Server.

With Babelfish enabled, you don’t have to swap out database drivers or take on the significant effort of rewriting and verifying all of your applications’ database requests.

Check out the AWS Open Source blog on Babelfish here https://aws.amazon.com/blogs/opensource/want-more-postgresql-you-just-might-like-babelfish/

AWS are going to open source Babelfish in Q1 2021 until then you can sign up for the Amazon Aurora preview. You can also check out the Babelfish community here https://babelfish-for-postgresql.github.io/babelfish-for-postgresql/

SageMaker Data Wrangler

In some industries up to 92% of analytics project time is spent doing data wrangling (sourcing, ETL, cleaning etc) in order to get ready for the actual Machine Learning and Analytics workloads.

Amazon SageMaker Data Wrangler claims to reduce the time it takes to aggregate and prepare data for machine learning and simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

https://aws.amazon.com/sagemaker/data-wrangler/

SageMaker Feature Store

Like data wrangling feature engineering can be a time consuming process. Once completed it makes sense to be able to share the results with other people who might be developing machine learning workloads based on the same datasets.

Just as a data catalog enables an organisation to discover data assets the new Feature Store in Sagemaker provides a repository where you can store and access features so it’s much easier to name, organise, and reuse them across teams.

Check out the details here https://aws.amazon.com/sagemaker/feature-store/

SageMaker Pipelines

Bringing CI/CD to machine learning workloads SageMaker Pipelines has been launched to help you automate different steps of the ML workflow, including data loading, data transformation, training and tuning, and deployment.

Check out the details here https://aws.amazon.com/sagemaker/pipelines/