Edmilson Silva

Posted on Aug 12, 2023 • Edited on Mar 3, 2024

Tutorial: How to Create a Pareto Chart Using Plotly 📐

#python #jupyternotebook #pareto #plotly

Pareto Graph

What is a pareto chart?

Simply put, it is a graph composed of bars and lines, the values must be individual.

Going back in time, being a bit theoretical, this chart named after the Pareto principle originates from the economist Wilfried Fritz Pareto.

He brought up various subjects such as Pareto efficiency, microeconomics, Pareto distribution. I recommend reading more about it at the reference link.

The purpose of this chart is to bring important factors such as quality control, observation of defects, identification of problems and others.

Guys, I would like to ask you please not to forget to leave a like, it helps me to see if you are enjoying the content. Also to reach more people

Requirements:

Possess basic understanding of the Python language
Possess basic understanding of Pandas, Numpy and Plotly libraries.

First we need to install the Plotly. To create some very dynamic graphics, this tool helps a lot.

!pip3 install plotly

Let's add some imports to work with:

import pandas as pd
import numpy as np
from plotly.graph_objects import Figure, Scatter, Bar

I fetched a dataset from the platform Kaggle, what was the Videogame. Something very simple.

I chose the games genres column. We have to have the amount of these genres and their percentage to use in the chart. In addition, we need to order the values. I created a function to carry out all this processing and also if we want to use another column, just pass the column name.

def build_dataframe(name_dataframe, col):
    dataframe = pd.read_csv(f'/kaggle/input/videogamesales/{name_dataframe}.csv')
    grp = dataframe.groupby([col])[col].count()
    df = pd.DataFrame(grp)
    df.index.name = ''
    df = df.sort_values(by=[col], ascending=False)
    count = dataframe[col].value_counts().rename(f'{col}_count')
    percentage = dataframe[col].value_counts(normalize=True).rename(f'{col}_percentage')
    df = pd.concat([count, percentage], axis=1)

    return df

output:

df = build_dataframe('vgsales', 'Genre')
df

Now we need to create a new column where we will need the cumulative values. Well if you go to the Pandas library there is a function called DataFrame.cumsum(axis=None, skipna=True, *args, **kwargs). But I like to do the functions by hand from time to time, without being ready. That is my issue.

def cumulative(dataframe, col):
    df = dataframe.copy()
    names_group = list(df.index)
    df['cumulative'] = 0
    iter_n = 0
    for n, name in enumerate(names_group):
        if n == 0:
            df.loc[name, ['cumulative']] = df.loc[names_group[n], [f'{col}_percentage']][0]
        else:
            df.loc[name, ['cumulative']] = df.loc[names_group[iter_n], ['cumulative']][0] + df.loc[names_group[n], [f'{col}_percentage']][0]
            iter_n += 1

    df['cumulative'] = df['cumulative'] * 100
    return df

output:

df = cumulative(df, 'Genre')
df

Well, now with the set ready to be used by the chart, let's move on to developing the chart.

Graph Pareto

At the beginning we import from plotly graph objects the classes Scatter and Bar. The Scatter to have the line of accumulated values and the Bar to express the quantities.
Watch what I did, to create the Pareto chart we need two more charts the Bar and the Scatter. So I need to create a list where I'm going to add the two inside, this will be useful when creating the figure. For both I will pass the necessary parameters and add the column of the set created for when with its type. I'll also need to configure your layout, this can vary a lot depending on what you want, be it color, size, text.

Now with that all defined, I'm going to create a figure, using the Figure Class. I'm going to add my graphics and my layout to form my figure. Finally add fig.show() to display the graph.

def graph_pareto(dataframe, col):
    df = dataframe.copy()

    data = [
        Bar(
          name = "Count",  
          x= df.index,
          y= df[f'{col}_count'], 
          marker= {"color": list(np.repeat('rgb(71, 71, 135)', 5)) + list(np.repeat('rgb(112, 111, 211)', len(df.index) - 5))}
        ),
        Scatter(
          line= {
            "color": "rgb(192, 57, 43)", 
            "width": 3
          }, 
          name= "Percentage", 
          x=  df.index,
          y= df['cumulative'], 
          yaxis= "y2",
          mode='lines+markers'
        ),
    ]

    layout = {
      # Title Graph
      "title": {
        'text': f"{col} Pareto",
        'font': dict(size=30)
      }, 
      # Font 
      "font": {
        "size": 14, 
        "color": "rgb(44, 44, 84)", 
        "family": "Times New Roman, monospace"
      },

      # Graph Box 
      "margin": {
        "b": 20, 
        "l": 50, 
        "r": 50, 
        "t": 10,
      }, 
      "height": 400, 

      # Graph Box 

      "plot_bgcolor": "rgb(255, 255, 255)", 


      # Settings Legend
      "legend": {
        "x": 0.79, 
        "y": 1.2, 
        "font": {
          "size": 12, 
          "color": "rgb(44, 44, 84)", 
          "family": "Courier New, monospace"
        },
        'orientation': 'h',
      },

      # Yaxis 1 position left

      "yaxis": {
        "title": f"Count {col}",
        "titlefont": {
        "size": 16, 
        "color": "rgb(71, 71, 135)", 
        "family": "Courier New, monospace"
        },
      }, 


      # Yaxis 2 position right
      "yaxis2": {
        "side": "right",
        "range": [0, 100], 
        "title": f"Percentage {col}",
        "titlefont": {
          "size": 16, 
          "color": "rgb(71, 71, 135)", 
          "family": "Courier New, monospace"
        },
        "overlaying": "y",
        "ticksuffix": " %",
      }, 

     #---------------

    }

    # Build Graph
    fig = Figure(data=data, layout=layout)
    # Apresents Graph
    fig.show()
graph_pareto(df, 'Genre')

Now to finish, I'm going to create a function to accumulate everything that was created. Test with other columns in our set.

def show_pareto(name_dataframe, col):
    df = build_dataframe('vgsales', col)
    df = cumulative(df, col)
    graph_pareto(df, col)

output:

show_pareto('vgsales', 'Genre')

show_pareto('vgsales', 'Platform')

show_pareto('vgsales', 'Publisher')

Comments

Thanks for reading this far. I hope I can help you understand. Any code or text errors please do not hesitate to return. Don’t forget to leave a like so you can reach more people.

Resources

Notebook Pareto

Wilfried Fritz Pareto

About the author:

Edmilson Silva

Machine learning, deep learning, and raw code. Presented clearly and with examples.

A little more about me...

Graduated in Bachelor of Information Systems, in college I had contact with different technologies. Along the way, I took the Artificial Intelligence course, where I had my first contact with machine learning and Python. From this it became my passion to learn about this area. Today I work with machine learning and deep learning developing communication software. Along the way, I created a blog where I create some posts about subjects that I am studying and share them to help other users.

I'm currently learning TensorFlow and Computer Vision

Curiosity: I love coffee