mwang-cmn

Posted on Jul 24 • Edited on Jul 27

Customer Segmentation - Time Based Cohort Analysis

#dataanalysis #cohortanalysis #python #customersegmentation

Introduction

In the competitive landscape of modern business, understanding customer behavior is crucial for sustaining growth and maintaining a competitive edge. One of the most effective methods for gaining these insights is cohort analysis, which groups customers based on shared characteristics and tracks their behavior over time. This project demonstrates a comprehensive approach to conducting a time-based cohort analysis, focusing on customer retention, purchase behavior, and engagement.

Importance of Cohort Analysis for Businesses

Cohort analysis offers deep insights into customer behavior by tracking groups of customers (cohorts) who share common characteristics or experiences. This approach helps businesses:

Identify patterns and trends in customer behavior over time
Understand the impact of marketing efforts and product changes
Improve customer relationship management strategies

Data Preparation and Data Cleaning

The Data used in this analysis contains transactions ,made by customers from 28 countries.
Load and preprocess the dataset to ensure it is ready for analysis.

import pandas as pd

# Load the data
url = r"C:\Users\user.DESKTOP-1OSCPEL\Documents\Cohort Analysis\Ecommerce.csv"
data = pd.read_csv(url1, parse_dates = ['InvoiceDate'])
data.head()

Data Description
The dataset used in this cohort analysis is a transactional dataset from an unspecified retail business. It includes detailed records of customer purchases over a specific period. The dataset contains the following columns:

InvoiceNo: A unique identifier for each transaction. Each transaction can involve multiple items.
InvoiceDate: The date and time when the transaction was generated. This column is essential for time-based analysis, as it allows us to track the timing of each purchase.
CustomerID: A unique identifier for each customer. This column enables us to group purchases by customer and track their behavior over time.
StockCode: A unique identifier for each product. This column helps in identifying different products purchased.
Description: A brief description of the product. This column provides a human-readable name for each product.
Quantity: The number of units of the product purchased in each transaction. This column helps in understanding the volume of items sold.
UnitPrice: The price per unit of the product. This column, combined with the quantity, allows us to calculate the total sales for each product in each transaction.
Country: The country where the customer resides. This column helps in analyzing geographical patterns in customer behavior. I removed the null values and there were no duplicates in the dataset.

Exploratory Data Analysis

Quantity Sold by Country
These are the top 10 countries by Quantity sold:

These are the bottom 10 countries by Quantity sold:

Users by Country
These are the top 10 countries by number of unique customers:

These are the bottom 10 countries by number of unique customers:

Trend in Monthly Sales
A new column, Sales was calculated. I calculated the Monthly Sales and plotted a line graph using seaborn as follows

#  Calculate the Sales column
data['Sales'] = data['Quantity'] * data['UnitPrice']
# Extract Year-Month
data['Year-Month'] = data['InvoiceDate'].dt.to_period('M')

# Monthly sales
monthly_sales = data.groupby(['Year-Month'])['Sales'].sum()

#monthly_sales
plt.figure(figsize=(12,6))
sns.lineplot(x=monthly_sales.index.strftime('%y-%m'), y=monthly_sales.values, data=monthly_sales, ci=None)
plt.title('Monthly Trend of Sales')
plt.xlabel('Date')
plt.ylabel('Sales in millions')
plt.xticks(rotation=0)
plt.grid(True, which='both',linewidth=0.5)
plt.tight_layout()
plt.show()

The result was as follows:

Observations

December 2010 to August 2011: Noticeable fluctuations in sales quantity.
August 2011 to November 2011: A significant sales increase occurred.
November 2011 to December 2011: Sales experienced a sudden and substantial decline.

Cohort Analysis

Defining Cohorts

Identify cohorts based on the initial purchase dates and calculate the cohort indices.
1. Define Cohort Date
Since this is a time-based cohort analysis, the cohorts will be grouped according to the dates they made their first purchase

2. Define Cohort Index
The cohort Index is a measure of the time interval in months since a particular cohort made their first purchase. Index 4 for instance, indicates that this cohort made their first purchase 4 months ago. The cohort index is the difference between the Cohort Date and Invoice Date i.e. recent purchase time.

The resulting dataset with the Cohort Date and Cohort Index was as follows:

3. Calculating Cohort Metrics

Aggregate the data to calculate metrics such as unique users, average quantity, and retention rates.

# Group data by CohortDate and CohortIndex, and count unique CustomerIDs
cohort_data = data.groupby(['CohortDate', 'CohortIndex'])['CustomerID'].nunique().reset_index()

# Rename the 'CustomerID' column to 'Customers' for better clarity
cohort_data.rename(columns={"CustomerID": "Customers"}, inplace=True)

# Pivot the data to create a matrix where rows represent CohortDate, columns represent CohortIndex, and values represent the number of unique Customers
cohortdata = cohort_data.pivot(index='CohortDate', columns='CohortIndex', values='Customers')

# Display the pivoted cohort data
cohortdata

Visualizing Cohort Analysis

Generate heatmaps to visualize average quantity per cohort, retention rates, and users per cohort.
1. Average Quantity ordered per Cohort

average_quantity = data.groupby(['CohortDate', 'CohortIndex'])['Quantity'].mean().reset_index()
average_quantity['Quantity'] = average_quantity['Quantity'].round(1)
#Rename the Quantity column
average_quantity.rename(columns={'Quantity': 'Avg. Quantity'}, inplace=True)
#pivot the table 
quantity_pivot = average_quantity.pivot(index='CohortDate', columns=['CohortIndex'], values='Avg. Quantity')
# Convert index to the name of the month and year
quantity_pivot.index = quantity_pivot.index.strftime('%B %Y')
# Visualize using a heatmap
plt.figure(figsize = (15,8))
sns.heatmap(quantity_pivot, annot=True, cmap ='Dark2')
plt.title("Heatmap - Average Quantity Bought per Cohort", fontsize=14, fontweight='bold')
plt.show()

Insights

The Average Quantity bought is almost constant after a few months. While there maybe fluctuations in customer retention, the volume of sales does not fluctuate, indicating that few customers purchase a large quantity of products.
Target marketing can be increased in countries where more quantity is sold, rather than where more customers are acquired e.g. Tunisia, South Africa, Rwanda, Somalia etc

2. Retention Rate
Retention rate is a metric that measures the percentage of customers who continue to engage with a business over a specified period. It is calculated by dividing the number of retained customers by the total number of customers at the start of the period.
Importance of Retention Rate:

Indicates customer loyalty and satisfaction
Helps businesses understand long-term engagement
Provides insights into the effectiveness of customer retention strategies

# Calculate the Retention Rates 
cohort_retention_table = cohortdata.divide(cohortdata.iloc[:,0], axis=0)
cohort_retention_table

Heatmap for Retention Rates

# Visualize the retention rates for each cohort in a heatmap
plt.figure(figsize=(15,8))
sns.heatmap(cohort_retention_table, annot=True, cmap='Dark2', fmt='.0%')
plt.title("Heatmap - Retention Rates per cohort", fontsize=14, fontweight='bold')
plt.show()

3. Users per Cohort

# Convert Index to a readable format
cohortdata.index = cohortdata.index.strftime("%B-%Y")
# Plot heatmap
plt.figure(figsize=(15,8))
sns.heatmap(cohortdata, annot=True,cmap= 'Dark2_r', fmt=".2f")
plt.title("Heatmap - Users per Cohort", fontsize=14, fontweight='bold')
plt.show()

Analysis and Insights

The following are observations from the analysis:

Retention Rates: The retention rate ranges between 20% and 40%, meaning that 20% to 40% of customers continue to make purchases from the e-commerce platform.
December 2010 Cohort Outperforms Others: The December 2010 cohort has a retention rate above 30%, indicating strong customer engagement likely due to effective marketing or customer retention strategies.
Decline in December 2011: A noticeable decline in retention rates for all cohorts in December 2011 suggests potential issues during that period, warranting further investigation.
Variability in Retention Rates: Retention rates vary significantly, from a minimum of 8% to a maximum of 50%, indicating differing customer behaviors across cohorts.
Average Quantity Bought: The average quantity purchased remains relatively constant after a few months, suggesting stable sales volume despite fluctuations in customer retention. This indicates that fewer customers purchase larger quantities of products.

Recommendations

Identify Factors Driving High Retention (December 2010): Analyze the factors contributing to the high retention rate for the December 2010 cohort and replicate successful strategies for other cohorts.
Investigate December 2011 Drop: Investigate the causes behind the low retention rates in December 2011 by analyzing customer feedback, product quality, customer service, or any changes in business operations.
Set Realistic Targets: Set specific retention targets based on historical data and industry benchmarks, aiming to improve retention rates gradually over time.
Implement Retention Strategies: Develop and implement tailored retention strategies such as personalized marketing, loyalty programs, and targeted communication to improve retention rates.
Continuously Monitor and Adapt: Regularly monitor retention rates and cohort data to adapt strategies as needed, ensuring informed decisions to enhance customer retention.
Target Marketing in High-Quantity Regions: Increase target marketing efforts in countries where higher quantities are sold, such as Tunisia, South Africa, Rwanda, and Somalia, rather than focusing solely on customer acquisition.

Conclusion

Understanding customer behavior through cohort analysis provides invaluable insights that can drive strategic decision-making. This analysis highlights the importance of customer segmentation, as it allows businesses to identify and understand different customer groups' behavior over time. By segmenting customers based on their initial purchase dates, we can track retention rates, average purchase quantities, and overall engagement. This granular understanding helps in tailoring marketing strategies, improving customer retention efforts, and ultimately driving business growth. Proper segmentation ensures that resources are allocated effectively, catering to the specific needs and preferences of different customer cohorts.
Find the Notebook - Here

DEV Community

Customer Segmentation - Time Based Cohort Analysis

Introduction

Importance of Cohort Analysis for Businesses

Data Preparation and Data Cleaning

Exploratory Data Analysis

Cohort Analysis

Defining Cohorts

Visualizing Cohort Analysis

Analysis and Insights

Recommendations

Conclusion

Top comments (0)

Read next

Building a desktop launcher

Quick and Dirty Guide to Running a Local LLM and Making API Requests

👋🏻Goodbye Power BI! 📊 In 2025 Build AI/ML Dashboards Entirely Within Python 🤖

10 Mind-Blowing AI Predictions That Will Change Our World by 2030