A long time ago in a galaxy far, far away ...
A group of people thought they can use whatever data they have right away to generate a killer analysis that is so advanced that it's light year ahead of any analysis that was done so far. And they failed instantly! π
Unless your data is coming from a system that you built, that is inside a bomb shelter π£ and no one has access to except for you, maybe then you can use whatever data this system outputs out of the box. Other than that, you will almost always have to do some sort of pre-processing on your data so it can be usable in your use case as more often than not raw data will likely contain (specially if it's collected manually) empty values, wrong entries types ("four" instead of 4 in a "quantity" column π), duplicated rows, strange column names and my personal favorite, typos in a text column π .
Transforming your data from its raw (sometimes called "dirty") state to its "clean" state where it can be used in whatever you want it to be used for, is called "data wrangling" (yes, the cowboys thing π)
This includes (and not limited to):
- Removing/replacing empty values.
- Removing duplicates.
- Renaming columns.
- Creating new columns.
- Dropping irrelevant columns.
- Changing columns data types.
- Enriching the data from external sources.
- ...
The good news is if you know a little bit of python and have the basic understanding of the data wrangling operations, you can do most of the wrangling stuff relatively easy using python, more specifically the pandas πΌ package. π€©
So let's begin our "Data Wrangling with Python" crash course π
Β Β Β Β Β The following section assumes that you have python installed on your system. Either a normal python installation or a conda installation are fine.
(optional) Create a virtual environment:
Β Β Β Β Β With python, you can create several environments isolated from your base installation to develop using a specific python/package version, test things out, try new packages or new packages features or for organizational purposes. In most cases, it's good thing π. The following will create a folder named wranglingWithPandas, cd into it, create a virtual environment called venv_pandas and activate it.
In your terminal, type
mkdir wranglingWithPandas
cd wranglingWithPandas
pip -m venv venv_pandas
.\venv_pandas\Scripts\activate
(Windows)
source ./venv_pandas/bin/activate
(Linux / MacOS)
Installing pandas:
pip install pandas
(optional) Installing Ipython or Jupyter Lab
Β Β Β Β Β It's advisable when working with data to work in an interactive environments. One way to do that is by installing ipython or jupyterlab packages, whichever you are familiar with π€·ββοΈ.
(only one is required)
pip install ipython
pip install jupyterlab
To run them, type ipython or jupyter lab (note the space) in your terminal.
Importing data:
Β Β Β Β Β Pandas can import data from a wide variety of data sources. Usually it uses the function read_x where "x" is csv, excel, json, ..., etc. In this tutorial, we will use a csv (comma separated values) file representing a transactions (purchases) log in a grocery store. Each row represents one item in a single transaction. One transaction can have multiple items. Sometimes, manual entries can happen π¬. You can find the sample dataset and all the code in this tutorial in this Github repo. The goal of this tutorial is to wrangle this file to find out what is the total revenue per item category and the top 3 purchased items.
To read the csv file in a "DataFrame" (pandas' way of saying, table π), we use the read_csv function:
import pandas as pd # A pandas convention π
import re # Regex library. We will use that later.
import pathlib # To manage paths in an OOP why.
import datetime # To manage dates.
file_path = pathlib.Path("store_data_20230116.csv") # Assuming the file is in the same directory.
store_data = pd.read_csv(file_path)
Data Exploration:
Β Β Β Β Β The purpose of data exploration is to understand the data at hand and to identify the issues that the data contains. One way to quickly explore it, is to use the head and info methods as follows.
store_data.head()
store_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10038 entries, 0 to 10037
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tran ID 10038 non-null int64
1 Cat 9956 non-null object
2 Item 9966 non-null object
3 Qty 10023 non-null float64
4 Unit Price 10011 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 392.2+ KB
And to explore the distinct values, you can use the unique method.
store_data.Item.unique()
array(['black-eyed peas', 'chickpeas', 'apples', 'grapes', 'lettuce',
'lentils', 'potatos', 'bananas', 'carrots', 'broccoli', 'LETTUCE',
nan, 'oranges', 'navy beans', 'Oranges', 'Apples', 'Carrots',
'Potatos', 'LENTILS', 'Broccoli', 'Chickpeas', 'CARROTS',
'NAVY BEANS', 'Navy beans', 'BANANAS', 'BROCCOLI', 'Bananas',
'Lentils', 'Lettuce', 'ORANGES', 'Black-eyed peas', 'Grapes',
'GRAPES', 'CHICKPEAS', 'APPLES', 'BLACK-EYED PEAS'], dtype=object)
store_data.Cat.unique()
array(['legumes', 'fruits', 'vegetables', nan, 'VEGETABLES', 'Fruits',
'Legumes', 'Vegetables', 'FRUITS', 'LEGUMES'], dtype=object)
As you can see, the Cat and Item columns contain the same items (per column) but the text case differs. This might be due to the manual input. We will deal with that later.
Another problem that we have is the dreaded null values π£!
isnull and isna methods can help with identifying the null values in your data.
store_data.isnull().sum()
Tran ID 0
Cat 82
Item 72
Qty 15
Unit Price 27
dtype: int64
Apparently all the columns except for Tran ID contain null values!
Duplicated rows are another issue that you may find in your data! The duplicated method can help to identify those.
store_data.duplicated().sum() # Duplicated records count.
175
store_data[store_data.duplicated(keep=False)].head(10)
And now for the fun part, Data Wrangling π:
Β Β Β Β Β As I said, Data Wrangling consists of a lot of operations. You may perform all or a subset of them depending on what kind of data issues you discovered in your dataset.
In this tutorial, we will do some basic operations to deal with what we have found.
We will start by renaming the columns. We can do this operation either in bulk or individually. When working with dataframes, I personally prefer lower case column names without any spaces. This makes referencing the columns juuuust a litle bit π€ faster and avoids lots of potential issues with column names that contains spaces. This is what we will do here.
# Doing a bulk rename
store_data.columns = [col.replace(" ", "_").lower() for col in store_data.columns]
# Doing individual rename
store_data.rename(
columns={
"cat": "category",
"qty": "quantity"
},
inplace=True # Without this, the rename method with return a copy of the modified dataframe.
)
store_data.columns
Index(['tran_id', 'category', 'item', 'quantity', 'unit_price'], dtype='object')
Now we will get rid of the 175 duplicated rows that we saw earlier as it's unlikely to have repeated items in the same transaction, just increase the item's quantity π€·ββοΈ (note that the duplicated method used here decides that the rows are duplicates when all the columns are the same for those rows. you can change this behavior with the subset parameter)
# One way to drop the duplicates is to not select them.
count_before = len(store_data)
store_data = store_data[~store_data.duplicated()]
count_after = len(store_data)
print(f"Rows count before removing duplicates {count_before}, and after {count_after}. The diff is {count_before - count_after}")
Rows count before removing duplicates 10038, and after 9863. The diff is 175
Next, the tran_id won't be needed for this analysis. Let's drop it!
store_data.drop(
columns="tran_id",
inplace=True
)
store_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9863 entries, 0 to 10037
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 category 9781 non-null object
1 item 9791 non-null object
2 quantity 9848 non-null float64
3 unit_price 9836 non-null float64
dtypes: float64(2), object(2)
memory usage: 385.3+ KB
Remember when we found out that the category and item columns' text are similar, but the case is different? let's correct that.
store_data.category = store_data.category.str.lower()
store_data.item = store_data.item.str.lower()
With str, you can access a lot of string functions that are offered by pandas and lower() is one of them that is vectorized (you can think of it as working in parallel) across the whole pandas' series (column).
Dealing with null values (a.k.a. missing values) is a whole topic on its own. One wrong decision about your missing data and your analysis goes in a whole different direction (maybe the wrong one! π¬). We will try to deal with null values in this dataset within the given context.
If you recall, we found out that all the columns contain null values (except for the tran_id but we have dropped that).
the first column we will deal with is quantity.
I've decided to drop all rows with missing quantities. After all, I can't infer (accurately guessing π) this figure from anywhere and a wrongfully inferring the quantity may affect the quality of our analysis. Besides, they are only 15 rows from roughly 10K rows π. In order to do that, dropna method is your freind.
store_data.dropna(
subset="quantity",
inplace=True
)
Another case where we can drop the rows with null values, is when the category and item are both nulls! Those rows are basically meaningless. (if you can assume different and unique unit_price for each item, you can infer those columns. But this assumption is dangerous)
store_data.dropna(
subset=["quantity", "item"],
inplace=True
)
If we check now for the columns with null values, we will find that now, only category and unit_price contain them.
store_data.isna().sum()
category 48
item 0
quantity 0
unit_price 15
dtype: int64
If we assumed that each item has only one category, the missing values in the category column can be easily obtained.
categories = store_data[["category", "item"]].groupby(by=["category", "item"], as_index=False).count()
categories_dict = {row["item"]: row["category"] for _, row in categories.iterrows()}
categories_dict
{'apples': 'fruits',
'bananas': 'fruits',
'grapes': 'fruits',
'oranges': 'fruits',
'black-eyed peas': 'legumes',
'chickpeas': 'legumes',
'lentils': 'legumes',
'navy beans': 'legumes',
'broccoli': 'vegetables',
'carrots': 'vegetables',
'lettuce': 'vegetables',
'potatos': 'vegetables'}
Basically, we have created a "lookup dictionary" for the items' categories. Now we can use this dict with the versatile apply method as follows to get the missing categories.
store_data.category = store_data.item.apply(lambda x: categories_dict[x])
The same technique can be used to get the missing unit_price values.
unit_prices = store_data[["item", "unit_price"]].groupby(by=["item", "unit_price"], as_index=False).count()
unit_prices_dict = {row["item"]: row["unit_price"] for _, row in unit_prices.iterrows()}
store_data.unit_price = store_data.item.apply(lambda x: unit_prices_dict[x])
store_data.isna().sum()
category 0
item 0
quantity 0
unit_price 0
Yaaaay π! No missing values!
We can also change the data type of the columns. We can do that for the quantity column as we can change its type to Int because the quantity in our scenario can't be a fraction. In addition to that, we can enrich our data. Like we can add a total_amount column calculated as quantity x unit_price and a done_on column contains the date when this dataset was created that can be obtained from the csv file name.
store_data.quantity = store_data.quantity.astype(int)
store_data["total_amount"] = store_data.quantity * store_data.unit_price
file_date = re.search(r"\d+", file_path.stem).group() # Regex to match all the digits in the file name without the extension (stem)
file_date = datetime.datetime.strptime(file_date, "%Y%m%d") # Parse a date object from the previouse match.
store_data["done_on"] = file_date
And now, the analysis π€©
First, the total revenue for each item category.
(
store_data[["category", "total_amount"]]
.groupby(by="category")
.sum()
)
Next, the top 3 purchased items
(
store_data[["item", "quantity"]]
.groupby(by="item")
.sum()
.sort_values(by="quantity", ascending=False)
.head(3)
)
Of course, any analysis is often presented using graphs and plots! but those tables will suffice for now as you get the idea π.
Conclusion
Β Β Β Β Β Data Wrangling is very important when dealing with any dataset. And it is very time-consuming too! There is a statistic that is circulating in the data field that data scientists spend 80% of their time just preparing the data π―! Python and Pandas are doing a very good job in this domain but not without limitations (check out Apache Spark and see why it's used for big data processing instead of pandas)!
The scenario that I built this tutorial on, isn't the best real-life scenario you can face. But hey, it did the job π.
Finally, you might have encountered here bits of code that looks more like incantations than a piece of code π
. That's normal specially if you are just starting out with python. But I can't explain everything here! I highly encourage you to have a look at the accompanying notebook. It contains all the code in this tutorial with extra explanation. And also, to lookup anything that you didn't understand either in online forums or the documentations but don't let it pass! That's how you'll learn π.
Top comments (0)