AWS Redshift (Part 1)

#aws #redshift #database

As an AWS solutions architect, you must set up a solution that helps the data analysts in your company to process large historical data for some released products. The data scientists and the developers suggest collecting all the results of the queries for additional analytics with Amazon EMR, Athena and SageMaker. What AWS solution can you use in this context?

To answer this question, you need first to know what type of database you are dealing with.

Generally, we can classify databases into two groups, according to the approach that they use, which affects the type of data we want to extract eventually:

1. On-Line Transactional Processing databases(OLTP) :

Like RDS, it has a high transaction volume of simple and short queries. OLTP databases rely on four main operations: Create, Read, Update and Delete.

For example, with RDS you can CREATE a table containing products and their corresponding prices, you can READ the content of the table, UPDATE the names or the prices of the products and DELETE a product that you will no longer sell for the customers.

2. On-Line Analytical Processing Databases(OLAP):

It has a relatively low transaction volume of sophisticated and long queries that urge aggregations. OLAP DBs are used mainly for analytics.

Through the previous definitions, it became obvious that an OLAP is required in our context. An example of an OLAP database on AWS is Redshift.

Redshift is fully managed by AWS. It is a petabyte-scale data warehouse service.

Unlike RDS and many other OLTP databases which use rows, Redshift uses columns to store data. It also uses advanced compression and Massive parallel processing of data . This makes it ten times faster than SQL databases.

Redshift helps to report visualize and analyze collected data. You can save the results of your queries to an S3 data lake so you can do additional analytics with services provided by AWS like Athena and SageMaker.

Although Redshift is fully managed by AWS, it is set up ONLY in ONE availability zone and can’t take large data ingestion in real-time.