dev_neil_a

Posted on Aug 1, 2022 • Edited on Mar 9

Introduction to Data Analysis with Python Part 1: Using Pandas and Exporting to Excel

#python #programming #beginners #datascience

Introduction
What is Pandas?
Pre-requisites
Using Pandas to Import and Clean-up Data
References

Introduction

In this multi-part series, I'll be going over some of the basics that I've learned over the past few weeks, starting with Pandas.

By the end of the series, the topics that it will have covered include:

Importing data from a CSV or Excel file into Pandas
Cleaning up the data and removing unusable data.
Converting between datatypes.
Exporting and importing data to and from Excel files.
Performing operations against the data to create additional data, such as a total for example.
Using other libraries such as NumPy to perform numerical-based operations on data instead of Pandas.
Creating visualisations of the data with MatPlotLib, another library for Python.

Part one of this series will cover importing data from a CSV file, cleaning it up, converting some of it to different datatypes and then exporting it to Excel.

Let's begin by first looking at what is Pandas.

What is Pandas?

Pandas is an easy to use library for Python that you can use to perform data analysis and manipulation of data that can come from a number of sources, such as:

CSV files
Excel files
JSON files / API responses
Python lists, tuples or dictionaries

When you import data with Pandas, the data is stored in what is called a DataFrame. A DataFrame can be thought of as an Excel spreadsheet in terms of how it is presented.

Now, let's make a start using Pandas.

Pre-requisites

Prior to starting, you will need to have Miniconda installed, along with a conda environment configured.

For a list of all the required modules and their dependencies, go to the GitHub repository link at the bottom and use the environment.yml file in there to
install them with conda.

If you need to install and setup Miniconda, please see the Miniconda documentation here for the downloads and setup instructions. There is a file in the GitHub repository linked in the references section called environment.yml. This has a list of all the modules, libraries and dependencies that are needed for this series. You can use the file to install them with conda.

For simplicity, I would recommend using Jupyter Notebooks for any data analysis that you do with Python.

A plugin for Visual Studio Code is available to work with Jupyter Notebooks, if you use it for your text editor or IDE.

There is a Jupyter Notebook available with all the steps below in the GitHub repository linked in the references section.

Using Pandas to Import and Clean-up Data

Step 1. Import Pandas

import pandas as pd

Step 2. Create a Pandas DataFrame From the Data CSV File

In this section, a new Pandas DataFrame will be created from a CSV file.

Once that has been done, we will take a look at some of the data and the datatypes of each column in the DataFrame.

Step 2.1. Create the Pandas DataFrame From the CSV File

order_data = pd.read_csv("data/order-data.csv")

Step 2.2. Show the First Five Rows of the Pandas DataFrame

order_data.head(n = 5)

As you can see, the DataFrame looks very similar to a spreadsheet.

The number in the first column that doesn't have a column name is the index. This is automatically created by Pandas when it creates a DataFrame. It can be used for referencing data directly, which will be covered in another part of this series.

Step 2.3. Show the DataTypes of Each Column in the Pandas DataFrame

order_data.info()

An object datatype typically implies a string of text.

Step 2.4. Show the Total Row Count in the Pandas DataFrame

print(f"Total Rows (Before NaN Removal): {len(order_data)}")

Step 3. Check for NaN (null) Values and Clean-Up

Before performing any kind of data manipulation, the data needs to be cleaned up.

Firstly, the data will be checked for NaN (Not a Number - basically an empty (null value) cell) in any of the rows and columns (axes) and depending upon the criteria, will be either removed or changed.

Here are the rules for the clean-up of the data:

order_id missing: Delete row.
order_date missing: Delete row.
customer_name missing: Set name to "Jane Bloggs".
item_vendor_name missing: Set name to "Unknown".
item: Remove any "," from the description and replace with " -".
item missing: Delete row.
item_qty missing: Delete row.
price_per_unit missing: Delete row.
price_currency missing: Set to "GBP".

Prior to deleting any rows from the Pandas DataFrame, those rows will be added to a new Pandas DataFrame called order_data_removed so that it can then be added to a separate Excel worksheet for reference.

Step 3.1. Check for NaN In All Rows and Columns

order_data.isna().sum()

Step 3.2. Replace NaN Values (Where Required)

Replace NaN in customer_name with "Jane Bloggs":

order_data["customer_name"].fillna(value = "Jane Bloggs",
                                   inplace = True)

Note: By default, when you perform the above operation, it won't update the DataFrame permanently. It will only do it for that operation.

To get around this, you can either perform a reassignment (basically putting order_data["customer_name"] = at the beginning of the above) or, in some cases, you can use inplace = True to achieve the same thing. Not all functions have that argument available so you may need to use reassignment for that function (Spoilers: There are some in this article later on (3.3 is the first one)).

Check for any NaN values in customer_name:

print(f'customer_name NaN: {order_data["customer_name"].isna().sum()}')

Replace NaN in item_vendor_name with "Unknown":

order_data["item_vendor_name"].fillna(value = "Unknown",
                                      inplace = True)

Check for any NaN values in item_vendor_name:

print(f'item_vendor_name NaN: {order_data["item_vendor_name"].isna().sum()}')

Replace NaN in price_currency with "GBP":

order_data["price_currency"].fillna(value = "GBP",
                                   inplace = True)

Check for any NaN values in price_currency:

print(f'price_currency NaN: {order_data["price_currency"].isna().sum()}')

Step 3.3. Replace "," with " -" in item Column

Just to play safe, remove any "," in the item column with " -" so that there are no issues with exporting to CSV (if you wanted to) or any other format where "," could cause issues.

There could also be issues with running other operations that you may wish to use.

order_data["item"] = order_data["item"].str.replace(",", " -")

Step 3.4. Convert order_date Column To Date from Object

Check the order_date datatype before converting it and what it currently looks like (yyyy/mm/dd):

print(f'order_date Data Type (Before Conversion): {order_data["order_date"].dtype}')
print(f'order_date (Before Conversion):\n{order_data["order_date"].head(n = 5)}')

Now, perform the conversion to a datetime datatype that is more usable for Pandas:

order_data["order_date"] = pd.to_datetime(order_data["order_date"], 
                                          format = "%Y-%m-%d", 
                                          utc = False)

Check the order_date datatype after converting it and what it now looks like (yyyy-mm-dd):

print(f'\norder_date Data Type (After Conversion): {order_data["order_date"].dtype}')
print(f'order_date (After Conversion):\n{order_data["order_date"].head(n = 5)}')

Step 3.5. Convert item_qty Column To Integer from Float

Check the item_qty datatype before converting it and what it currently looks like (1.0 for example):

print(f'item_qty Data Type (Before Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (Before Conversion):\n{order_data["item_qty"].head(n = 5)}')

Now, perform the conversion to an integer datatype:

order_data["item_qty"] = order_data["item_qty"].convert_dtypes(convert_integer=True)

Check the item_qty datatype after converting it and what it now looks like (1 for example):

print(f'\nitem_qty Data Type (After Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (After Conversion):\n{order_data["item_qty"].head(n = 5)}')

Step 3.6. Round price_per_unit to Two Decimal Places

Check the price_per_unit before rounding it (up or down) and what it currently looks like (1.234 for example):

print(f'price_per_unit Data Type (Before Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (Before Rounding):\n{order_data["price_per_unit"].head(n = 5)}')

Next, perform the rounding to two decimal places:

order_data["price_per_unit"] = order_data["price_per_unit"].round(decimals=2)

Lastly, check the price_per_unit after rounding it (up or down) and what it now looks like (1.23 for example):

print(f'\nprice_per_unit Data Type (After Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (After rounding):\n{order_data["price_per_unit"].head(n = 5)}')

Step 3.7. Remove Lines With NaN Values (Where Required)

Before removing any rows with NaN values, place those rows into a separate Pandas DataFrame so that it can be exported later on, mostly for reference or perhaps to be used for other purposes, should they arise:

order_data_removed = order_data[order_data.isna().any(axis = 1)]

Next, show the first five rows of the order_data_removed Pandas DataFrame:

order_data_removed.head(n = 5)

Next, check for total NaN entries in the order_data_removed Pandas DataFrame:

order_data.isna().sum()

Lastly, as the clean-up criteria for NaN values has been completed, any remaining NaN values in the order_data Pandas DataFrame can now be removed:

order_data.dropna(inplace = True)

Check for any remaining NaN values. There should be none:

order_data.isna().sum()

Check the total number of rows in the order_data Pandas DataFrame:

print(f"Total Rows (After NaN Removal): {len(order_data)}")

Step 3.8. Reset the order_data and order_data_removed Indexes

As there have been some rows removed from the order_data Pandas DataFrame, the index will need to be reset, otherwise it will still have the old index in there that started at 0 and ended at 999 (1000 rows).

Note: By default, reset_index will create a new index and place the old index into a new column. To stop the old index being created in a new column, use drop = True.

First, reset the index of the order_data Pandas DataFrame:

order_data.reset_index(inplace = True, 
                       drop = True)

Show the last five lines of the order_data Pandas DataFrame:

order_data.tail(n = 5)

Next, reset the index of the order_data_removed Pandas DataFrame:

order_data_removed.reset_index(inplace = True, 
                               drop = True)

Show the last five lines of the order_data_removed Pandas DataFrame:

order_data_removed.tail(n = 5)

Step 3.9. Review The Current State of the order_data Pandas DataFrame

Show first five rows to see what the data looks like:

order_data.head(n = 5)

Next, check that all of the datatypes are correct:

order_data.info()

Step 4. Export The Two DataFrames To Excel

The final step will be to export the two DataFrames (order_data and order_data_removed) to an Excel workbook. Each DataFrame will be on its own worksheet.

with pd.ExcelWriter(path = "order_data_exported.xlsx", 
                    engine = "xlsxwriter",
                    date_format = "YYYY-MM-DD",
                    datetime_format = "YYYY-MM-DD") as writer:

    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data")

    order_data_removed.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data_removed")

You should now see the Excel file in the same folder where you ran the script / Notebook from. You can open it and review the contents.

This concludes part one of this series.

References

GitHub Repository Link

Miniconda Installation Documentation

DEV Community