DCAI Community for Data-Centric AI Community

Posted on Dec 20, 2022

How to compare 2 datasets with pandas-profiling 🐼

#python #beginners #datascience #jupyter

Visualization is the cornerstone of Exploratory Data Analysis

When facing a new, unknown dataset, visual inspection allows us to get a feel of the available information, draw some patterns regarding the data, and diagnose several issues that we might need to address.

pandas-profiling has been the indispensable swiss-knife in every data scientist’s tool belt. However, something that seemed to be missing was the ability to compare different reports side-by-side, which would help us continuously assess the transformations performed during EDA!

Side-by-side comparison: the wait is over!

pandas-profiling now supports a "side-by-side" comparison feature that lets us automate the comparison process with a single line of code.

In this blogpost, I'll put you up to speed with this new functionality and show you how we can use it to produce faster and smarter transformations on our data.

I’ll be using the HCC Dataset, which I have personally collected during my MSc. For this particular use case, I’ve artificially introduced some additional data quality issues to show you how visualisation can help us detect them and guide us towards their efficient mitigation.

The full code and examples can be found on this GitHub repository so you can follow along the tutorial.

pandas-profiling: EDA at your fingertips

We’ll start by profiling the HCC dataset and investigating the data quality issues suggested in the report:

pip install pandas-profiling==3.5.0

import pandas as pd
from pandas_profiling import ProfileReport

# Read the HCC Dataset
df = pd.read_csv("hcc.csv")

# Produce the data profiling report
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")

Alerts shown in Pandas Profiling Report.

According to the "Alerts" overview, there are four main types of potential issues that need to be addressed:

Duplicates: 4 duplicate rows in data;
Constant: Constant value “999” in 'O2';
High Correlation: Several features marked as highly correlated;
Missing: Missing Values in ‘Ferritin’

The validity of each potential problem (as well as the need to find a mitigation strategy for it) depends on the specific use case and domain knowledge. In our case, with the exception of the "high correlation" alerts, which would require further investigation, the remaining alerts seem to reflect true data quality issues and can be tackled using a few practical solutions. Let's see how!

Removing Duplicate Rows

Depending on the nature of the domain, there might be records that have the same values without it being an error. However, considering that some of the features in this dataset are quite specific and refer to an individual’s biological measurements (e.g., "Hemoglobin", "MCV", "Albumin"), it’s unlikely that several patients report the same exact values for all features. Let’s start by dropping these duplicates from the data:

# Drop duplicate rows
df_transformed = df.copy()
df_transformed = df_transformed.drop_duplicates()

Removing Irrelevant Features

The constant values in O2 also reflect a true inconsistency in data. There may be two main reasons for such an error to arise: either the O2 values were measured and stored automatically in the database and the pulse oximeter failed, or the person taking this measurement kept evaluating repeated erroneous messages and simply coded them as “999”, which is an absurd value (O2 values range from 0% to 100%). In all cases, these values are erroneous and should therefore be removed from the analysis:

# Remove O2
df_transformed = df_transformed.drop(columns='O2')

Missing Data Imputation

As frequently happens with medical data, HCC dataset also seems extremely susceptible to missing data. A simple way to address this issue (avoiding removing incomplete records or entire features) is data imputation. We’ll use mean imputation to fill in the absent observations, as it is the most common and simple of statistical imputation techniques and often serves as a baseline method:

# Impute Missing Values
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(strategy="mean")
df_transformed['Ferritin'] = mean_imputer.fit_transform(df_transformed['Ferritin'].values.reshape(-1,1))

Side-by-side comparison: faster and smarter iterations on your data

Now for the fun part! After implementing the first batch of transformations to our dataset, we're ready to assess their impact on the overall quality of our data.

This is where the pandas-profiling report functionality comes in handy: the comparison between the original versus the transformed data can now be automatically performed through the .compare method of the ProfileReport:

transformed_report = ProfileReport(df_transformed, title="Transformed Data")
comparison_report = original_report.compare(transformed_report)
comparison_report.to_file("original_vs_transformed.html")

How did these transformations impacted the quality of our data? And What would we find by further investigating each of the transformations performed? Let’s dive deeper into the comparison results!

Dataset Overview

The comparison report shows both datasets ("Original Data" and "Transformed Data") and distinguishes their properties by respectively using a blue or red colour in titles and graph plots.

As shown in the report, the transformed dataset contains one less categorical feature ("O2" was removed), 165 observations (versus the original 171 containing duplicates) and no missing values (in contrast with the 79 missing observations in the original dataset).

Comparison Report: Dataset Statistics.

Duplicate Records

Conversely to the original data, there are no duplicate patient records in the transformed data: our complete and accurate case base can move onward to the modeling pipeline, avoiding data overfitting.

Comparison Report: Duplicate Rows.

Irrelevant Features

Features that have not been subjected to any transformation remain the same (as shown below for "Encephalopathy"): original and transformed data summary statistics do not change. In turn, removed features are only presented for the original data (shown in blue), as is the case of "O2".

Comparison Report: Encephalopathy remains the same.

Comparison Report: O2 is only shown for the original data.

Missing Values

Contrarily to the original data, there are no missing observations after the data imputation was performed. Note how both the nullity count and matrix show the differences between both versions of the data: in the transformed data, "Ferritin" has now 165 complete values and no blanks can be found in the nullity matrix.

Comparison Report: Missing Values.

A deeper investigation on data properties

If we were to compare all features prior and before the data transformations performed, we would find an insightful detail in what concerns missing data imputation.

When analysing the "Ferritin" values in higher detail, we’d see how imputing values with the mean has distorted the original data distribution, which is undesirable:

Comparison Report: Ferritin - imputed values seem to distort the original feature distribution.

This artefact is also observed through the visualisation of interactions and correlations, where daft interaction patterns and higher correlation values emerge in the relationship between "Ferritin" and the remaining features.

Comparison Report: Interactions between Ferritin and Age: imputed values are shown in a vertical line corresponding to the mean.

Comparison Report: Correlations - Ferritin correlation values seem to increase after data imputation.

This comes to show that the comparison report is not only useful to highlight the differences introduced after data transformations, but it provides several visual cues that lead us towards important insights regarding those transformations: in this case, a more specialised data imputation strategy should be considered.

Conclusion

Throughout this small use case, I've covered the usefulness of comparing two sets of data within the same profiling report to highlight the data transformations performed during EDA and evaluate their impact on data quality. Nevertheless, the applications of this functionality are endless, as the need to (re)iterate on feature assessment and visual inspection is vital for data-centric solutions!

Made with ❤️ by the Data-Centric AI Community

Thank you for reading! If you enjoyed this tutorial and plan to use pandas-profiling in your data quests, please ⭐️ our repository and join the discussion on our Discord server!

DEV Community