Wendy Wong for AWS Community Builders

Posted on Jan 29, 2023 • Edited on Apr 2, 2023

Analyze Sydney property prices with Amazon Redshift Serverless in GA

#aws #tutorial #database #analytics

Are you concerned about Sydney property prices?

With Covid-19, tech workers have the freedom to work remotely permanently or as they return to the office in a hybrid environment.

Thus employees have opted for a lifestyle that includes the beach, country-side, city, suburbia or with more cultural amenities.

With low unemployment figures in Sydney, should upwardly mobile professionals stay in Sydney or move to a different city?

Lesson Outcomes

You will learn how to create a data warehouse with Amazon Redshift Serverless (Generally Available)
Discover data with AWS Glue Data Catalog
Query your data with Amazon Redshift Serverless (Generally Available)

What is Amazon Redshift Serverless?

Amazon Redshift Serverless is generally available since 12 July 2022, you may view the announcement here and this allows you to build a data warehouse without having to provision a Redshift cluster and it can easily scale to your usage needs.

As a data scientist or data analyst you can use Query Editor v2 to query the data and answer analytic questions.

You may use with Amazon Redshift Severless in GA and access $300 free AWS credits for 90 days in the following AWS regions as at 12 July 2022:

US East (Ohio, N. Virginia)
US West (N. California, Oregon)
Europe (Frankfurt, Ireland)
Asia Pacific (Tokyo)
Europe (London)
Europe (Stockholm)
Asia Pacific (Seoul)
Asia Pacific (Singapore)
Asia Pacific (Sydney)

You may refer to the pricing of Amazon Redshift Serverless here. You only pay for the compute resources that you use with the pay as you go model.

What are the use cases?

You don't need a data engineer to help you provision data warehouse resources, whether you work for a startup or enterprise you can access:

Self-service analytics
Generate insights that can auto scale

Solution Architecture

This is my summary of the solution architecture for Amazon Redshift Serverless and Amazon Redshift ML

Dataset

This open source dataset set 'Sydney House Prices' is provided here by Kaggle.com from 2000 to 2019.

Data Dictionary

Column Data Type

Date, Date
Id, String
suburb, String
postalCode, String
sellPrice, Integer
bed, Integer
bath, Integer
car, Integer
propType, Integer

Tutorial: Getting started with Amazon Redshift Serverless

This tutorial follows the instructions provided by the documentation here.

Step 1: Sign into the AWS Management Console as an IAM user.

Step 2: Change the AWS region to US-East-2 (Ohio)

Step 3. Upload the Sydney house prices dataset into an Amazon S3 bucket.

Create two folders in the S3 bucket for data transformations name them 'raw data' and 'processed'.

Step 4: Create a crawler from the AWS Glue console by selecting a data source.

Step 5: I used an existing IAM role for AWS Glue

Create a database.

Step 6: Review the Crawler and click Run Crawler.

The processing job of the crawler is completed.

Refresh your web browser and navigate to Database -> Table

Review the schema created from the raw data.

Edit the schema to change the data type for the variables Date and postalCode to Date and String respectively.

You may review the updated schema changes.

Step 7: Navigate to Amazon Redshift in the search bar and click Try Amazon Redshift Serverless

Step 8: Select default settings to create a Redshift cluster.
Because I nominated the AWS region Ohio, I will receive $300 free AWS credits over the 90-day free trial for AWS Redshift Serverless service.

Under the default settings, a namespace and workgroup will be automatically created by Amazon Redshift Serverless.

Step 9: Create a default IAM role which can access any Amazon S3 bucket.

Refresh your web browser, you will be able to see that the IAM policy 'Amazon Redshift Full Access' is successfully associated with the IAM user role. The VPC and security groups have been created.

Step 10: Click Save Configuration.

The Redshift cluster will take a few minutes to process.

Note: The 90-day free trial of Amazon Redshift Serverless will expire on 2 April 2023, please take note of your free trial expiration date.

Step 11: Click on Query Editor to access Query Editor v2

Step 12: Double-click on Serverless:default and dev database.

Create a table.

CREATE TABLE sydney_houseprices (
    date DATE,
    id VARCHAR(40),
    suburb VARCHAR(100),
    postalcode VARCHAR(40),
    sellprice bigint,
    bed bigint,
    bath bigint,
    car bigint,
    proptype VARCHAR(40))
    ;

Step 14: Load data into a new table using the COPY command for Amazon Redshift.

COPY sydney_house_prices 
FROM 's3://sydney-house-price/processed/SydneyHousePrices1.csv' 
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXX:role/service-role/AmazonRedshift-CommandsAccessRole-XXXXXXXXXXXX' 
delimiter ','
dateformat 'YYYY-MM-DD'
IGNOREHEADER 1;

Note: If you have trouble loading data into the table. Try these tips:

Change your input csv file date format from Australian to American date format i.e. 'YYYY-MM-DD'
Include the SQL statement 'IGNOREHEADER 1' to skip loading the header
Check for load errors:

SELECT * 
FROM stl_load_errors

Step 15: Check that data has successfully loaded into the table. Preview the first 10 rows.

SELECT * 
FROM sydney_house_prices 
LIMIT 10;

Step 16: Query your new table with your own analytical questions.

Q1. What was the most expensive suburb in the year 2000?

SELECT MAX(sellprice), suburb, DATEPART(year,date) AS year, postalcode AS postcode, date, id
FROM sydney_house_prices
WHERE year = '2000'
GROUP BY 2, 3, 4, 5,6
;

The suburb Bondi Beach had the most expensive property sold in the year 2000 at $3.5 million.

Q2. What was the most expensive suburb in the year 2019?

SELECT MAX(sellprice), 
suburb, 
DATEPART(year,date) AS year, 
postalcode AS postcode, 
date, 
id,
proptype,
FROM sydney_house_prices
WHERE year = '2019' AND proptype = 'house'
GROUP BY 2, 3, 4, 5,6,7
ORDER BY 1 DESC
;

Rose Bay was the most expensive suburb in the year 2019 with a property sold for $13.4 million.

Q3. What was the average price of a house in Concord in 2019 with two car park spots and two bathrooms?

SELECT AVG(sellprice), 
suburb, 
DATEPART(year,date) AS year, 
proptype,
bed, 
car,
bath,
FROM sydney_house_prices
WHERE year = '2019' AND 
proptype = 'house' 
AND suburb = 'Concord'
AND bath =2
AND car = 2
GROUP BY 2, 3,4,5,6
;

The average price of a house in Concord in the year 2019 with two car park spots and two bathrooms is $1,968,000.

Q4. Which suburb sold the most properties in the year 2019 and what was the average property price?

SELECT COUNT(proptype) AS propertycount, 
AVG(sellprice) AS avg_pricesold,
suburb, 
DATEPART(year,date) AS year, 
postalcode AS postcode, 
date,
FROM sydney_house_prices
WHERE year = '2019' 
GROUP BY 3, 4,5,6
ORDER BY 1 DESC
;

The suburb Warriewood sold the most properties in 2019 with an average selling price of $1,038,136.

Q5. Where can I live with a budget of between $1,200,000 and $1,500,000 in the year 2019 with three bedrooms, two bathrooms and one car park spot?

SELECT sellprice, 
suburb, 
DATEPART(year,date) AS year, 
car,
bath,
bed,
FROM sydney_house_prices
WHERE year = '2019' AND 
sellprice BETWEEN 1200000 AND 1500000
AND bed = 3
AND bath = 2
AND car = 1
;

Within this budget, in 2019 I can afford to live in the suburbs
Maroubra, Turramurra, Cremorne, Ryde, Sans Souci and Allambie Heights.

Amazon Redshift Serverless (Generally Available) - Clean Up Resources

If you have finished using Amazon Redshift Serverless, you should follow best practice and clean up your resources by following the steps in the AWS documentation at this link.

I encourage you to make the most of your Redshift Serverless free-trial in GA which is 90 days and keep exploring with your $300 AWS credits.