Building a cloud data warehouse and business intelligence modernization
In this final tutorial for AWS Redshift I will explain how we can bake our cake and eat it too by loading data into the Redshift cluster through to creating data visualizations with AWS QuickSight.
If you are reading this post for the first time, you may wish to read the last three tutorials:
Table of Contents
- Quick Introduction
- High level architecture
I am an AWS Community Builder in data, a data scientist and also a Business Performance Analyst at Service NSW. I have consulted and instructed in data analytics and data science for government, financial services and startups for 6 years. It was at an AI startup and in Public Sector that I used Amazon services including Amazon S3, Amazon Sagemaker, AWS Glue, Amazon Redshift and Amazon QuickSight.
- Service NSW is part of Department of Customer Service
- A One stop shop NSW government agency
- Our vision to become the world’s most customer-centric government agency
We built a well-architected modern data analytics pipeline with a fully-managed, petabyte-scale cloud data warehouse using Amazon Redshift under the Department of Customer Service using AWS Organization.
With lake house architecture we integrated a data lake, data warehouse and business intelligence with security, performance, elasticity, data velocity, cost effectiveness and governance to design a unified analytics platform for our organization-wide data architecture roadmap.
Amazon Redshift has massively parallel processing (MPP) for fast and complex queries and can analyze all our data.
The AWS services used to build our analytics roadmap were:
High level architecture
In a busy contact centre we serve customers and generate petabytpes of data in structured and un-structured data formats. One of our initial questions on the journey with AWS included how do we store all of this data? How do we create a data lake to also bring in external data from another data lake?
Working in a sprint fashion, this greenfield government project included test, build and deploy. By understanding the current state we wanted to create a vision of the end state and how to improve customer experience for NSW citizens.
Using a Miro board to sketch the high level architecture (it's ok to draw what you think might be the final solution and then consult and iterate) collaborating with platform teams and also consulting with AWS to develop a solution using AWS Cloud Formation template.
With this greenfield project using my consulting hat, I helped our stakeholders develop the first use case, iterate and develop the next one and so forth. What is our minimum viable product? What are our blockers? IAM permissions and stopping and starting. Reading the database developer guide, testing with a sandbox environment and also working in a production environment.
1. Amazon S3 - Data Ingestion Layer
In building a solution for the first business use case to ingest data extracted from an external cloud and upload the csv files into folders within our Amazon S3 bucket (our data lake). Our team were given access to AWS S3 via IAM user roles under our AWS account.
2. Amazon Glue Studio - Data Processing Layer
Amazon Glue Studio was used to pre-process data from the 'source' Amazon S3 bucket stored in a csv file to extract, transform and load data back into a 'target' Amazon S3 bucket.
Amazon Glue Studio is a great data manipulation tool or ETL to visually inspect data pre-processing steps and them if there are errors, and Python code is scripted by updating nodes and successfully completing information at each check point. There are AWS Glue Studio tutorials to get started and you may also watch on-demand sessions from AWS re:invent 2021.
Dates can be updated from a string to a date data format quickly using AWS Glue Studio. At the time of writing this blog we are testing automated scripts to clean data from AWS S3 as our 'source'and AWS Redshift cluster as our 'target' and we are testing a sample table before a production solution is created.
3. Amazon Redshift - Data Consumption Layer
A data engineer granted permissions to the AWS Redshift cluster role. Using the Query Editor, a connection to the database was created and a table was created using SQL.
At the data consumption layer, data was loaded from an AWS S3 bucket into the AWS Redshift cluster using the COPY command.
4. Amazon QuickSight - Business Intelligence and Dashboards
Connect to the data stored in the Amazon Redshift cluster by connecting to the database using the credentials and import the data into Amazon QuickSight using SPICE(super-fast, parallel, in-memory calculation engine).
Once data is loaded into SPICE, you may inspect the data using a preview feature. Once data is loaded you may create custom visualizations such as a bar chart and Amazon QuickSight will provide business intelligence informing you of the Top 10 and Bottom 10 insights of the selected data fields with a short narrative which is very useful to help you understand your data.
Amazon QuickSight allows you to share your dashboard and schedule reports via email to select audiences and download the dashboard a a PDF.
Why use a Lake house approach?
We have the task to improve customer experience to understand NSW citizens by building a data architecture roadmap to include business intelligence, reporting with structured, semi-structured and unstructured data and building machine learning capability for sentiment analysis and topic modelling
The benefits of using a lake house approach included:
- Analyze all our data, unified data access
- Scalable workloads
- Run fast and complex queries
- Access to analytical services for ETL
A checklist of business requirements:
- Tableau Server and Amazon Redshift connectors
- Fast SQL query performance
- Unstructured data can be stored in Amazon S3 data lake
- Fully managed enterprise data warehouse on AWS
- Based on PostgreSQL
Amazon Redshift Architecture
Amazon Redshift is an enterprise data warehouse fully managed on AWS. Customer thinks about servers. How much RAM? How much CPU is needed?
- Deployment Considerations: Good data model, Optimize Amazon Redshift Cluster. Columnar storage, fewer joins, denormalizing table schemas and keeping columns sizes in tables as narrow as possible. All of these will improve query performance.
- Client applications: Amazon Redshift integrates with ETL and BI
- Connections: Amazon Redshift communicates with client applications using JDBC and ODBC drivers for PostgreSQL e.g. Tableau and Pentaho.
- Leader Node: Communicates with compute nodes to perform database operations.
- Compute Nodes: Leader node compiles code the execution plan and assigns the code to individual compute nodes.
- Node Slices: A compute node is partitioned into slices. Each node processes a portion of the workload assigned to the node.
Columnar storage for tables optimizes analytic query performance to drastically reduces the overall disk I/O requirements and the amount of data you need to load from disk.
With massively parallel processing (MPP), Amazon Redshift distributes the rows of a table to the compute nodes so data can be processed in parallel.
Amazon Redshift Cluster
The Amazon Redshift cluster was launched and configured using Cloud Formation template.
The component details of the cluster included:
- 1 cluster, 1 leader node and 3 compute nodes
- Node type: dc2.large with storage 160 GB/node. $0.25/node/hour
- DC2 high performance
- Cluster permissions: IAM role created for the Amazon Redshift role for our IAM User Group
- Cluster Details: Database details created with database port number 5439
- Network and security: Two subnets created in two different Availability zones to improve fault tolerance
Demo - create an Amazon Redshift cluster
Step 1: Login in to AWS Management Console
In the Amazon Management Console you may type Amazon Redshift in the search bar.
Step 2: Navigate to Amazon Redshift home page
Click on Create cluster.
Step 3: In Cluster configuration provide a name for your cluster
If your organization has never created an Amazon Redshift cluster you may access the Free Tier, otherwise click Production
Step 4: Configure cluster:
- Node type: dc2.large
- Nodes: 3
Step 5: Database configuration:
Step 6: Configure Cluster Permissions
- In order for Amazon Redshift to have access to Amazon S3 to load data an IAM role is created to access to cluster via a Redshift role
Step 7: Security Group
- VPC and security group attached to the VPC will be created by your administrator
- Subnet Group is also created by your administrator
Step 8: Click close to create cluster
It will take a few minutes for the cluster to be provisioned and you can check the status of the blue banner.
Step 9: Cluster created
After your cluster has successfully been created, the banner message will turn green and the created cluster will have a green tick with the wording 'available'.
Step 10: Inspect the properties of your Redshift cluster
- You may even download any JDBC and ODBC drivers
- You may resize the cluster if required.
Step 11: Click Query Editor
Connect to the database
Enter the database credentials that were used to create the Amazon Redshift cluster
Step 12: Create a Table
- Query Editor uses standard SQL commands to create a table.
Step 13: Load data from Amazon S3 into Amazon Redshift
- Use the COPY command.
Step 14: Fast Querying
- SQL queries may be saved and scheduled
- SQL queries can return results in milliseconds
- Export results
Step 15: Connect to Amazon Redshift database
- Upload data by connecting to the Amazon Redshift cluster.
Step 16: Review the data
- Preview and format data in Amazon QuickSight before it is loaded into SPICE.
Step 17: SPICE and Amazon QuickSight
Once data is stored in SPICE, data visualizations can be created for reporting dashboards to provide business insights.
Amazon QuickSight can provide the Top 10 and Bottom 10 insights and a narrative using machine learning to help you understand your data.
Dashboards can be easily shared via email and reports may be scheduled.
You may also watch on-demand from AWS re:Invent to learn more about the new add-on enterpise feature for AWS QuickSight Q that can help you author dashboards based on topics using natural language processing which is very exciting!
You can read the blog here
Stay informed with the latest AWS product updates from AWS Re:invent 2021:
- AWS Re-invent 2021 announcements link
Until next time, happy learning!
Top comments (2)
Adit here, fellow community builder.
Just a suggestion, you can make use of Series under post option.
Give the series a unique name. (Series visible once it has multiple posts)
Great Article series on redshift btw !!
Thanks for your useful tip and suggestion, I will ping you on Slack to show me how to use Series :) Happy New Year!