DEV Community

Milan Mitrovic
Milan Mitrovic

Posted on

Plotly Histogram

Recently I started playing with one interesting dataset. I had dataset of 150k rows that comes from one big European bank. I was interested to figure out if there is any correlation between monthly salary and default probability of borrower.

First thing that I tried to chart was histogram of monthly salary. Next thing I tried to do was to show histogram with default/non-default borrowers separated with different colour.

Since dataset is strictly confidential, let's artificially create data set for testing purpose.

import pandas as pd
import numpy as np
import plotly.express as px

import plotly.io as pio
pio.renderers.default = 'browser'

x = np.random.exponential(size=100000, scale=20) + 50000

df = pd.DataFrame({
            'monthly_salary': x,
            'default': np.random.choice([1, 0], size=len(x))
            })

# Purpose of this column is to help us count number of clients
# that belong to each bin group.
df['help_column'] = 1
Enter fullscreen mode Exit fullscreen mode

This is how generated table looks like.

Image description

Idea here is to create histogram of monthly salaries. There are two ways, one quicker, and another where we have more control over what is going under the hood.

Let's start with easier approach.

fig = px.histogram(
    data_frame=df,
    x='monthly_salary',
    nbins=200
)
fig.show()
Enter fullscreen mode Exit fullscreen mode

This is how chart looks like.

Image description

In this case, bins are automatically created in Plotly Express function. We do not have control about size of bin, it is created automatically. We just supplied number of bins.

Another approach, which is a bit more complicated, is to use pandas functions to create bins of arbitrary size. After that, we will classify clients into corresponding bin groups.

bins_ = pd.interval_range(start=50000, end=50100, freq=1)
df['monthly_salary_BINS'] = pd.cut(x=x, bins=bins_)

# Idea is to have lower left boundary instead of upper-lower bound
# It is easier for plotting
df['monthly_salary_BINS_left'] = df['monthly_salary_BINS'].apply(func=lambda x: x.left)

xx = df[['help_column', 'monthly_salary_BINS_left']].groupby(by='monthly_salary_BINS_left').sum().reset_index()

fig = px.bar(
    x=xx['monthly_salary_BINS_left'],
    y=xx['help_column']
)

fig.show()
Enter fullscreen mode Exit fullscreen mode

Let's take a look at this chart.

Image description

...

What if we want to have one histogram for default and one for non-default borrowers?
Again, there are two approaches.

Let's start with easier again.

fig = px.histogram(
    data_frame=df_filtered,
    x='monthly_salary',
    color='default',
    nbins=200,
    barmode='group'
)
fig.show()
Enter fullscreen mode Exit fullscreen mode

Here is newly added chart.

Image description

Second way of plotting histogram requires pivoting data. Idea is to create cross tabulation first, and then to plot data.

Personally, I prefer this way. It gives me more control, I can clearly see table that is underlying chart and consequently I can do quality assurance timely.

Also, this way is more efficient. Not all data is going to be sent to browser. Only aggregated data will be stored on front end side, which is significantly lower amount.

df_pivoted = pd.pivot_table(data=df,
               values='help_column',
               index='monthly_salary_BINS_left',
               columns='default',
               aggfunc='sum')
fig = px.bar(
    df_pivoted,
    barmode='group'
)
fig.show()

Enter fullscreen mode Exit fullscreen mode

Pivot table that is underlying chart.

Image description

Here is amazing chart.

Image description

I hope you have enjoyed this tutorial. Happy cooking and see you in next plotting endeavour :)

Discussion (0)