Team RudderStack for RudderStack

Posted on Sep 20, 2021 • Originally published at rudderstack.com

How To Access And Query Your Amazon Redshift Data Using Python And R

#python #redshift #r #rudderstack

Overview

In this post, we will see how to access and query your Amazon Redshift data using Python. We follow two steps in this process:

Connecting to the Redshift warehouse instance and loading the data using Python
Querying the data and storing the results for analysis

Since Redshift is compatible with other databases such as PostgreSQL, we use the Python psycopg library to access and query the data from Redshift. We will then store the query results as a dataframe in pandas using the SQLAlchemy library.

The purpose of this exercise is to leverage the statistical techniques available in Python to get useful insights from your Redshift data. Some of the insights that you can get include a better understanding of your customers' product behavior, predicting churn rate, etc.

We also have a dedicated blog for manipulating and querying your Google BigQuery data using Python and R, in case you are interested.

For this post, we assume that you have the data already loaded in your Redshift instance. In case you haven't, the best way to load your data to Redshift is to leverage Customer Data Infrastructure tools such as RudderStack. They allow you to collect your data across all the customer touch-points, and load them securely into Redshift -- or any other warehouse of your choice, with minimal effort.

Connecting To Your Redshift Data Using Python

To access your Redshift data using Python, we will first need to connect to our instance. As mentioned above, Redshift is compatible with other database solutions such as PostgreSQL. Hence, you can safely use the tools you'd use to access and query your PostgreSQL data for Redshift.

We will use the psycopg Python driver to connect to our Redshift instance. That said, please feel free to experiment with any other library of your choice to do so.

import psycopg2
con=psycopg2.connect(dbname= 'dbname', host='host',
port= 'port', user= 'user', password= 'pwd')

We highly recommend that you use the above code as part of your pipeline, and wrap it in a function that handles any errors. The parameters that you need are typical for connecting to any database:

Name of the database
Host Name
Port
User Name
Password

Executing Queries On Your Redshift Data Using Psycopg

Once the database connection is set up, we can start querying the Redshift data. We use SQL queries to narrow down the amount of data we want for our analysis.

To do so with psycopg, we perform these steps:

We get a cursor from our database connection, like so:

cur = con.cursor()

We execute the query from the table that we want to pull our data from:

cur.execute("SELECT * FROM `table`;")

Once the query is successfully executed, we instruct psycopg to fetch the data from the database. For further data analysis, it makes sense to get the complete dataset. Hence, we run the following command:

cur.fetchall()

Finally, we close the cursor and the connection, like so:

cur.close()
conn.close()

The most significant bit here is the SQL query that we execute to fetch the records from the dataset. You can also use SQL to do large chunks of your data preprocessing and set up a proper dataset to make your data analysis a lot easier.

For example, you can join multiple tables in the database, or use the Redshift-supported aggregation functions to create new fields as per your requirement.

Using The Queried Data For Data Analysis

Now that we have successfully queried our Redshift data and fetched it for our analysis, it is time to work on it using the data analysis tools we have at our disposal.

When it comes to Python, the most popular libraries for data analysis are NumPy and pandas:

NumPy is a popular Python library used mainly for numerical computing.
pandas is a widely-used data analysis library in Python. It provides a high-performance data structure called DataFrame for working with table-like structures.

No matter what kind of analysis you wish to do, you will need to use one of these two libraries to represent your initial data.

Loading The Data To NumPy

It is quite straightforward to turn your data into a NumPy array. We initialize a new NumPy array and pass the cursor containing the query results as a parameter.

Run the following code in your Python console:

import numpy as np
data = np.array(cur.fetchall())

Loading The Data To Pandas

You can also use pandas instead of NumPy for your data analysis. For this, however, the steps involved are a bit different.

Refer to the following code snippet:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://scott:tiger@hredshift_host:<port_no>/mydatabase')
data_frame = pd.read_sql('SELECT * FROM `table`;', engine

As seen in the code above, we will use SQLAlchemy to connect to our Redshift instance using the connection credentials. Then, we use the read_sql method to make a SQL query on the database. Finally, we can load the results directly into a DataFrame and use it for our analysis.

How To Access And Query Redshift Data Using R?

Loading and querying your Redshift data is equally easy in R. We can use the RPostgreSQL package to connect to our Redshift instance and then run queries on the data.

Here is how to install the package in RStudio:

install.packages("RPostgreSQL")
require("RPostgreSQL")

The next step is to connect to the Redshift instance, as shown below:

drv <- dbDriver("PostgreSQL")
con <-dbConnect(drv,dbname="dbname",host="host",port=1234,
                user="user",password="password")
dbDisconnect(con)

Note: It is important to close the connection once you are done with pulling the data out of the database.

Querying the data and loading it into an R data frame is quite easy too:

df_postgres <- dbGetQuery(con, "SELECT * from `table`")

Summary

The first step in any data analysis process is to correctly procure the data and then store it in an easily readable format. In this post, we loaded the Redshift data using Python / R and then queried it to get the data useful for further analysis.

While we focused on Amazon Redshift in this blog, the process is also applicable to other databases like PostgreSQL. In this post, we used psycopg to connect to our Redshift instance -- the same Python connector can be used to connect to a PostgreSQL instance as well.

One of the advantages of residing your data within a database as opposed to other formats such as CSV files is the ability to query it using SQL. You can run complex SQL queries to preprocess your data effectively and save a lot of your time and effort in building statistical models for an in-depth data analysis.

Sign Up For Free And Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get Started

DEV Community