DEV Community

Cover image for Mastering MultiIndexes in Pandas: A Powerful Tool for Complex Data Analysis
Glenn Viroux
Glenn Viroux

Posted on

Mastering MultiIndexes in Pandas: A Powerful Tool for Complex Data Analysis

Pandas is a widely used data manipulation library in Python that offers extensive capabilities for handling various types of data. One of its notable features is the ability to work with MultiIndexes, also known as hierarchical indexes. In this blog post, we will delve into the concept of MultiIndexes and explore how they can be leveraged to tackle complex, multidimensional datasets.

Understanding MultiIndexes: Analyzing Sports Performance Data

A MultiIndex is a pandas data structure that allows indexing and accessing data across multiple dimensions or levels. It enables the creation of hierarchical structures for rows and columns, providing a flexible way to organize and analyze data. To illustrate this, let's consider a scenario where you are a personal trainer or coach monitoring the health parameters of your athletes during their sports activities. You want to track various parameters such as heart rate, running pace, and cadence over a specific time interval.

Synthetic Health Performance Data

To work with this type of data, let's begin by writing Python code that simulates health performance data, specifically heart rates and running cadences:

from __future__ import annotations
from datetime import datetime, timedelta
import numpy as np
import pandas as pd

start = datetime(2023, 6, 8, 14)
end = start + timedelta(hours=1, minutes=40)
timestamps = pd.date_range(start, end, freq=timedelta(minutes=1), inclusive='left')

def get_heart_rate(begin_hr: int, end_hr: int, break_point: int) -> pd.Series[float]:
    noise = np.random.normal(loc=0.0, scale=3, size=100)
    heart_rate = np.concatenate((np.linspace(begin_hr, end_hr, num=break_point), [end_hr] * (100 - break_point))) + noise
    return pd.Series(data=heart_rate, index=timestamps)

def get_cadence(mean_cadence: int) -> pd.Series[float]:
    noise = np.random.normal(loc=0.0, scale=1, size=100)
    cadence = pd.Series(data=[mean_cadence] * 100 + noise, index=timestamps)
    cadence[::3] = np.NAN
    cadence[1::3] = np.NAN
    return cadence.ffill().fillna(mean_cadence)
Enter fullscreen mode Exit fullscreen mode

The code snippet provided showcases the generation of synthetic data for heart rate and cadence during a sports activity. It begins by importing the necessary modules such as datetime, numpy, and pandas.

The duration of the sports activity is defined as 100 minutes, and the pd.date_range function is utilized to generate a series of timestamps at one-minute intervals to cover this period.

The get_heart_rate function generates synthetic heart rate data, assuming a linear increase in heart rate up to a certain level, followed by a constant level for the remainder of the activity. Gaussian noise is introduced to add variability to the heart rate data, making it more realistic.

Similarly, the get_cadence function generates synthetic cadence data, assuming a relatively constant cadence throughout the activity. Gaussian noise is added to create variability in the cadence values, with the noise values being updated every three minutes instead of every minute, reflecting the stability of cadence compared to heart rates.

With the data generation functions in place, it is now possible to create synthetic data for two athletes, Bob and Alice:

bob_hr = get_heart_rate(begin_hr=110, end_hr=160, break_point=20)
alice_hr = get_heart_rate(begin_hr=90, end_hr=140, break_point=50)
bob_cadence = get_cadence(mean_cadence=175)
alice_cadence = get_cadence(mean_cadence=165)
Enter fullscreen mode Exit fullscreen mode

At this point, we have the heart rates and cadences of Bob and Alice. Let's plot them using matplotlib to get some more insight into the data:

from __future__ import annotations
import matplotlib.dates as mdates
import matplotlib.pyplot as plt

date_formatter = mdates.DateFormatter('%H:%M:%S')  # Customize the date format as needed

fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(date_formatter)
ax.plot(bob_hr, color="red", label="Heart Rate Bob", marker=".")
ax.plot(alice_hr, color="red", label="Heart Rate Alice", marker="v")
ax.grid()
ax.legend()
ax.set_ylabel("Heart Rate [BPM]")
ax.set_xlabel("Time")

ax_cadence = ax.twinx()
ax_cadence.plot(bob_cadence, color="purple", label="Cadence Bob", marker=".", alpha=0.5)
ax_cadence.plot(alice_cadence, color="purple", label="Cadence Alice", marker="v", alpha=0.5)
ax_cadence.legend()
ax_cadence.set_ylabel("Cadence [SPM]")
ax_cadence.set_ylim(158, 180)
Enter fullscreen mode Exit fullscreen mode

Health performance plot

Great! The initial analysis of the data provides interesting observations. We can easily distinguish the differences between Bob and Alice in terms of their maximum heart rate and the rate at which it increases. Additionally, Bob's cadence appears to be notably higher than Alice's.

Using Dataframes for Scalability

However, as you might have already noticed, the current approach of using separate variables (bob_hr, alice_hr, bob_cadence, and alice_cadence) for each health parameter and athlete is not scalable. In real-world scenarios with a larger number of athletes and health parameters, this approach quickly becomes impractical and cumbersome.

To address this issue, we can leverage the power of pandas by utilizing a pandas DataFrame to represent the data for multiple athletes and health parameters. By organizing the data in a tabular format, we can easily manage and analyze multiple variables simultaneously.

Each row of the DataFrame can correspond to a specific timestamp, and each column can represent a health parameter for a particular athlete. This structure allows for efficient storage and manipulation of multidimensional data.

By using a DataFrame, we can eliminate the need for separate variables and store all the data in a single object. This enhances code clarity, simplifies data handling, and provides a more intuitive representation of the overall dataset.

bob_df = pd.concat([bob_hr.rename("heart_rate"), bob_cadence.rename("cadence")], axis="columns")
Enter fullscreen mode Exit fullscreen mode

This is what the Dataframe for Bob’s health data looks like:

heart_rate cadence
2023-06-08 14:00:00 112.359 175
2023-06-08 14:01:00 107.204 175
2023-06-08 14:02:00 116.617 175.513
2023-06-08 14:03:00 121.151 175.513
2023-06-08 14:04:00 123.27 175.513
2023-06-08 14:05:00 120.901 174.995
2023-06-08 14:06:00 130.24 174.995
2023-06-08 14:07:00 131.15 174.995
2023-06-08 14:08:00 131.402 174.669

Introducing Hierarchical Dataframes

The last dataframe looks better already! But now we still have to create a new dataframe for each athlete. This is where pandas MultiIndex can help. Let's take a look at how we can elegantly merge the data of multiple athletes and health parameters into one dataframe:

from itertools import product
bob_df = bob_hr.to_frame("value")
bob_df["athlete"] = "Bob"
bob_df["parameter"] = "heart_rate"

values = {
    "Bob": {
        "heart_rate": bob_hr,
        "cadence": bob_cadence,
    },
    "Alice": {
        "heart_rate": alice_hr,
        "cadence": alice_cadence
    }
}

sub_dataframes: list[pd.DataFrame] = []
for athlete, parameter in product(["Bob", "Alice"], ["heart_rate", "cadence"]):
    sub_df = values[athlete][parameter].to_frame("values")
    sub_df["athlete"] = athlete
    sub_df["parameter"] = parameter
    sub_dataframes.append(sub_df)

df = pd.concat(sub_dataframes).set_index(["athlete", "parameter"], append=True)
df.index = df.index.set_names(["timestamps", "athlete", "parameter"])
Enter fullscreen mode Exit fullscreen mode

This code processes heart rate and cadence data for athletes Bob and Alice. It performs the following steps:

  1. Create a DataFrame for Bob's heart rate data and add metadata columns for athlete and parameter.
  2. Define a dictionary that stores heart rate and cadence data for Bob and Alice.
  3. Generate combinations of athletes and parameters (Bob/Alice and heart_rate/cadence).
  4. For each combination, create a sub-dataframe with the corresponding data and metadata columns.
  5. Concatenate all sub-dataframes into a single dataframe.
  6. Set the index to include levels for timestamps, athlete, and parameter. This is where the actual MultiIndex is created

This is what the hierarchical dataframe df looks like:

values
(Timestamp('2023-06-08 14:00:00'), 'Bob', 'heart_rate') 112.359
(Timestamp('2023-06-08 14:01:00'), 'Bob', 'heart_rate') 107.204
(Timestamp('2023-06-08 14:02:00'), 'Bob', 'heart_rate') 116.617
(Timestamp('2023-06-08 14:03:00'), 'Bob', 'heart_rate') 121.151
(Timestamp('2023-06-08 14:04:00'), 'Bob', 'heart_rate') 123.27
(Timestamp('2023-06-08 14:05:00'), 'Bob', 'heart_rate') 120.901
(Timestamp('2023-06-08 14:06:00'), 'Bob', 'heart_rate') 130.24
(Timestamp('2023-06-08 14:07:00'), 'Bob', 'heart_rate') 131.15
(Timestamp('2023-06-08 14:08:00'), 'Bob', 'heart_rate') 131.402

At this point, we have got ourselves a single dataframe that holds all information for an arbitrary amount of athletes and health parameters. We can now easily use the .xs method to query the hierarchical dataframe:

df.xs("Bob", level="athlete")  # get all health data for Bob
Enter fullscreen mode Exit fullscreen mode
values
(Timestamp('2023-06-08 14:00:00'), 'heart_rate') 112.359
(Timestamp('2023-06-08 14:01:00'), 'heart_rate') 107.204
(Timestamp('2023-06-08 14:02:00'), 'heart_rate') 116.617
(Timestamp('2023-06-08 14:03:00'), 'heart_rate') 121.151
(Timestamp('2023-06-08 14:04:00'), 'heart_rate') 123.27
df.xs("heart_rate", level="parameter")  *# get all heart rates*
Enter fullscreen mode Exit fullscreen mode
values
(Timestamp('2023-06-08 14:00:00'), 'Bob') 112.359
(Timestamp('2023-06-08 14:01:00'), 'Bob') 107.204
(Timestamp('2023-06-08 14:02:00'), 'Bob') 116.617
(Timestamp('2023-06-08 14:03:00'), 'Bob') 121.151
(Timestamp('2023-06-08 14:04:00'), 'Bob') 123.27
df.xs("Bob", level="athlete").xs("heart_rate", level="parameter")  # get heart_rate data for Bob
Enter fullscreen mode Exit fullscreen mode
timestamps values
2023-06-08 14:00:00 112.359
2023-06-08 14:01:00 107.204
2023-06-08 14:02:00 116.617
2023-06-08 14:03:00 121.151
2023-06-08 14:04:00 123.27

Use Case: Earth Temperature Changes

To demonstrate the power of hierarchical dataframes, let's explore a real-world and complex use case: analyzing the changes in Earth's surface temperatures over the last decades. For this task, we'll utilize a dataset available on Kaggle, which summarizes the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).

Inspect and Transform Original Data

Let's begin by reading and inspecting the data. This step is crucial to gain a better understanding of the dataset's structure and contents before delving into the analysis. Here's how we can accomplish that using pandas:

from pathlib import Path

file_path = Path() / "data" / "Environment_Temperature_change_E_All_Data_NOFLAG.csv"
df = pd.read_csv(file_path , encoding='cp1252')
df.describe()
Enter fullscreen mode Exit fullscreen mode
Area Code Months Code Element Code Y1961 Y1962 Y1963 Y1964 Y1965 Y1966 Y1967 Y1968 Y1969 Y1970 Y1971 Y1972 Y1973 Y1974 Y1975 Y1976 Y1977 Y1978 Y1979 Y1980 Y1981 Y1982 Y1983 Y1984 Y1985 Y1986 Y1987 Y1988 Y1989 Y1990 Y1991 Y1992 Y1993 Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019
count 9656 9656 9656 8287 8322 8294 8252 8281 8364 8347 8345 8326 8308 8303 8323 8394 8374 8280 8209 8257 8327 8290 8283 8276 8237 8205 8259 8216 8268 8284 8273 8257 8239 8158 8354 8315 8373 8409 8439 8309 8370 8324 8342 8241 8312 8390 8415 8424 8503 8534 8475 8419 8435 8437 8350 8427 8377 8361 8348 8366 8349 8365
mean 821.806 7009.88 6674.5 0.402433 0.315527 0.317393 0.269382 0.217839 0.376419 0.263239 0.24487 0.382172 0.365322 0.240934 0.302553 0.427691 0.261849 0.314653 0.221112 0.422978 0.355488 0.442465 0.43827 0.437693 0.404857 0.503748 0.366971 0.365511 0.398096 0.535514 0.546662 0.469231 0.621797 0.499991 0.447798 0.439094 0.611078 0.635836 0.477239 0.617341 0.818264 0.704445 0.674191 0.741673 0.802509 0.769485 0.726237 0.777465 0.791795 0.842554 0.742614 0.814177 0.884504 0.768488 0.78893 0.829647 0.913872 1.01882 1.08149 1.00334 1.01083 1.0946
std 1781.07 6.03825 596.531 0.701567 0.713777 0.853133 0.749216 0.739418 0.73737 0.725421 0.7549 0.725313 0.662412 0.727313 0.765895 0.677769 0.76885 0.723964 0.755587 0.677094 0.662228 0.670377 0.638585 0.70157 0.675693 0.749268 0.700364 0.765491 0.710202 0.756869 0.664038 0.758738 0.724396 0.656754 0.843925 0.81786 0.752511 0.754133 0.72521 0.741812 0.785076 0.723811 0.783754 0.781683 0.840545 0.794839 0.644364 0.720366 0.821489 0.854893 0.852562 0.694464 0.878303 0.750631 0.858586 0.713753 0.815933 0.840189 0.877399 0.8098 0.872199 0.853953
min 1 7001 6078 -4.018 -5.391 -8.483 -7.309 -4.728 -8.147 -6.531 -8.407 -6.784 -5.847 -7.671 -7.722 -4.896 -4.732 -6.169 -4.263 -6.495 -8.228 -6.319 -5.784 -6.591 -4.55 -6.101 -5.437 -8.411 -6.345 -8.75 -8.963 -5.311 -4.03 -4.598 -5.414 -7.389 -5.099 -4.862 -4.027 -4.059 -6.031 -3.035 -3.596 -5.493 -6.17 -6.118 -5.025 -5.171 -4.981 -3.19 -9.334 -3.543 -6.072 -4.854 -5.785 -3.642 -5.367 -4.068 -3.306 -3.584 -2.216 -2.644
25% 78 7005 6078 0.057 -0.033 0.03025 -0.1025 -0.214 0.055 -0.169 -0.164 0.171 0.094 -0.209 -0.028 0.201 -0.184 -0.115 -0.219 0.174 0.091 0.217 0.2455 0.205 0.168 0.268 0.077 0.123 0.164 0.293 0.285 0.186 0.298 0.264 0.18425 0.214 0.289 0.324 0.244 0.305 0.385 0.327 0.3 0.349 0.38 0.37525 0.361 0.373 0.371 0.378 0.337 0.381 0.392 0.365 0.37225 0.4085 0.418 0.437 0.457 0.443 0.434 0.455
50% 153.5 7009 6674.5 0.366 0.333 0.355 0.326 0.303 0.36 0.313 0.312 0.385 0.367 0.305 0.346 0.413 0.305 0.325 0.309 0.388 0.35 0.406 0.424 0.409 0.39 0.468 0.373 0.374 0.38 0.504 0.478 0.404 0.49 0.448 0.419 0.427 0.493 0.539 0.45 0.528 0.715 0.548 0.514 0.596 0.668 0.658 0.597 0.676 0.646 0.667 0.589 0.693 0.771 0.64 0.651 0.719 0.745 0.858 0.949 0.865 0.81 0.939
75% 226.25 7016 7271 0.6765 0.627 0.64775 0.609 0.584 0.66025 0.601 0.595 0.677 0.642 0.5885 0.628 0.709 0.586 0.625 0.586 0.695 0.633 0.69775 0.701 0.709 0.689 0.825 0.666 0.674 0.678 0.854 0.802 0.728 0.8485 0.758 0.777 0.748 0.874 0.905 0.777 0.938 1.188 0.98825 0.937 1.028 1.08425 1.063 0.991 1.104 1.082 1.096 1.0425 1.1345 1.2765 1.091 1.108 1.126 1.19 1.389 1.496 1.36475 1.341 1.508
max 5873 7020 7271 5.771 4.373 4.666 5.233 5.144 5.771 4.768 4.373 4.411 4.373 4.373 9.475 6.304 6.912 6.15 7.689 4.875 5.956 5.483 4.519 6.144 5.411 6.513 9.303 5.948 6.845 4.516 4.982 6.841 9.73 6.11 6.017 5.989 6.477 7.221 7.27 5.637 6.816 7.017 5.836 6.092 8.719 6.171 7.461 7.651 11.331 7.655 8.298 6.415 7.19 6.531 10.826 6.738 11.759 7.59 10.478 7.389 9.228 7.215

From this initial inspection, it becomes evident that the data is organized in a single dataframe, with separate rows for different months and countries. However, the values for different years are spread across several columns in the dataframe, labeled with the prefix 'Y'. This format makes it challenging to read and visualize the data effectively. To address this issue, we will transform the data into a more structured and hierarchical dataframe format, enabling us to query and visualize the data more conveniently.

from dataclasses import dataclass, field
from datetime import date
from pydantic import BaseModel

MONTHS = {
    "January": 1, 
    "February": 2, 
    "March": 3, 
    "April": 4, 
    "May": 5, 
    "June": 6, 
    "July": 7, 
    "August": 8,
  "September": 9, 
    "October": 10,
  "November": 11, 
    "December": 12
}

class GistempDataElement(BaseModel):
    area: str
    timestamp: date
    value: float

@dataclass
class GistempTransformer:
    temperature_changes: list[GistempDataElement] = field(default_factory=list)
    standard_deviations: list[GistempDataElement] = field(default_factory=list)

    def _process_row(self, row) -> None:
        relevant_elements = ["Temperature change", "Standard Deviation"]
        if (element := row["Element"]) not in relevant_elements or (month := MONTHS.get(row["Months"])) is None:
            return None

        for year, value in row.filter(regex="Y.*").items():
            new_element = GistempDataElement(
                timestamp=date(year=int(year.replace("Y", "")), month=month, day=1),
                area=row["Area"],
                value=value
            )
            if element == "Temperature change":
                self.temperature_changes.append(new_element)
            else:
                self.standard_deviations.append(new_element)

    @property
    def df(self) -> pd.DataFrame:
        temp_changes_df = pd.DataFrame.from_records([elem.dict() for elem in self.temperature_changes])
        temp_changes = temp_changes_df.set_index(["timestamp", "area"]).rename(columns={"value": "temp_change"})

        std_deviations_df = pd.DataFrame.from_records([elem.dict() for elem in self.standard_deviations])
        std_deviations = std_deviations_df.set_index(["timestamp", "area"]).rename(columns={"value": "std_deviation"})

        return pd.concat([temp_changes, std_deviations], axis="columns")

    def process(self):
        environment_data = Path() / "data" / "Environment_Temperature_change_E_All_Data_NOFLAG.csv"
        df = pd.read_csv(environment_data, encoding='cp1252')
        df.apply(self._process_row, axis="columns")
Enter fullscreen mode Exit fullscreen mode

This code introduces the GistempTransformer class, which demonstrates the processing of temperature data from a CSV file and the creation of a hierarchical DataFrame containing temperature changes and standard deviations.

The GistempTransformer class, defined as a dataclass, includes two lists, temperature_changes and standard_deviations, to store the processed data elements. The _process_row method is responsible for handling each row of the input DataFrame. It checks for relevant elements, such as "Temperature change" and "Standard Deviation," extracts the month from the "Months" column, and creates instances of the GistempDataElement class. These instances are then appended to the appropriate lists based on the element type.

The df property returns a DataFrame by combining the temperature_changes and standard_deviations lists. This hierarchical DataFrame has a MultiIndex with levels representing the timestamp and area, providing a structured organization of the data.

transformer = GistempTransformer()
transformer.process()
df = transformer.df
Enter fullscreen mode Exit fullscreen mode
temp_change std_deviation
(datetime.date(1961, 1, 1), 'Afghanistan') 0.777 1.95
(datetime.date(1962, 1, 1), 'Afghanistan') 0.062 1.95
(datetime.date(1963, 1, 1), 'Afghanistan') 2.744 1.95
(datetime.date(1964, 1, 1), 'Afghanistan') -5.232 1.95
(datetime.date(1965, 1, 1), 'Afghanistan') 1.868 1.95

Analyzing Climate Data

Now that we have consolidated all the relevant data into a single dataframe, we can proceed with inspecting and visualizing the data. Our focus is on examining the linear regression lines for each area, as they provide insights into the overall trend of temperature changes over the past decades. To facilitate this visualization, we will create a function that plots the temperature changes along with their corresponding regression lines.

def plot_temperature_changes(areas: list[str]) -> None:
    fig = plt.figure(figsize=(12, 6))
    ax1 = fig.add_subplot(211)
    ax2 = fig.add_subplot(212)

    for area in areas:
        df_country = df[df.index.get_level_values("area") == area].reset_index()
        dates = df_country["timestamp"].map(datetime.toordinal)
        gradient, offset = np.polyfit(dates, df_country.temp_change, deg=1)
        ax1.scatter(df_country.timestamp, df_country.temp_change, label=area, s=5)
        ax2.plot(df_country.timestamp, gradient * dates + offset, label=area)

    ax1.grid()
    ax2.grid()
    ax2.legend()
    ax2.set_ylabel("Regression Lines [°C]")
    ax1.set_ylabel("Temperature change [°C]")
Enter fullscreen mode Exit fullscreen mode

In this function, we are using the **get_level_values** method on a pandas MultiIndex to efficiently query the data in our hierarchical Dataframe on different levels. Let's use this function to visualize temperature changes in the different continents:

plot_temperature_changes(["Africa", "Antarctica", "Americas", "Asia", "Europe", "Oceania"])
Enter fullscreen mode Exit fullscreen mode

plot of temperature changes in different continents

From this plot, we can draw several key conclusions:

  • The regression lines for all continents have a positive gradient, indicating a global trend of increasing Earth surface temperatures.
  • The regression line for Europe is notably steeper compared to other continents, implying that the temperature increase in Europe has been more pronounced. This finding aligns with observations of accelerated warming in Europe compared to other regions.
  • The specific factors contributing to the higher temperature increase in Europe compared to Antarctica are complex and require detailed scientific research. However, one contributing factor may be the influence of ocean currents. Europe is influenced by warm ocean currents, such as the Gulf Stream, which transport heat from the tropics towards the region. These currents play a role in moderating temperatures and can contribute to the relatively higher warming observed in Europe. In contrast, Antarctica is surrounded by cold ocean currents, and its climate is heavily influenced by the Southern Ocean and the Antarctic Circumpolar Current, which act as barriers to the incursion of warmer waters, thereby limiting the warming effect.

Now, let's focus our analysis on Europe itself by examining temperature changes in different regions within Europe. We can achieve this by creating individual plots for each European region:

plot_temperature_changes(["Southern Europe", "Eastern Europe", "Northern Europe", "Western Europe"])
Enter fullscreen mode Exit fullscreen mode

plot of temperature changes in Europe

From the plotted temperature changes in different regions of Europe, we observe that the overall temperature rises across the European continent are quite similar. While there may be slight variations in the steepness of the regression lines between regions, such as Eastern Europe having a slightly steeper line compared to Southern Europe, no significant differences can be observed among the regions.

Ten Countries Most and Less Affected by Climate Change

Now, let's shift our focus to identifying the top 10 countries that have experienced the highest average temperature increase since the year 2000. Here's an example of how we can retrieve the list of countries:

df[df.index.get_level_values(level="timestamp") > date(2000, 1, 1)].groupby("area").mean().sort_values(by="temp_change",ascending=False).head(10)
Enter fullscreen mode Exit fullscreen mode
area temp_change std_deviation
Svalbard and Jan Mayen Islands 2.61541 2.48572
Estonia 1.69048 nan
Kuwait 1.6825 1.12843
Belarus 1.66113 nan
Finland 1.65906 2.15634
Slovenia 1.6555 nan
Russian Federation 1.64507 nan
Bahrain 1.64209 0.937431
Eastern Europe 1.62868 0.970377
Austria 1.62721 1.56392

To extract the top 10 countries with the highest average temperature increase since the year 2000, we perform the following steps:

  1. Filter the dataframe to include only rows where the year is greater than or equal to 2000 using df.index.get_level_values(level='timestamp') >= date(2000, 1, 1).
  2. Group the data by the 'Area' (country) using .groupby('area').
  3. Calculate the mean temperature change for each country using .mean().
  4. Select the top 10 countries with the largest mean temperature change using **.sort_values(by="temp_change",ascending=True).head(10)**.

This result aligns with our previous observations, confirming that Europe experienced the highest rise in temperature compared to other continents.

Continuing with our analysis, let's now explore the ten countries that are least affected by the rise in temperature. We can utilize the same method as before to extract this information. Here's an example of how we can retrieve the list of countries:

df[df.index.get_level_values(level="timestamp") > date(2000, 1, 1)].groupby("area").mean().sort_values(by="temp_change",ascending=True).head(10)
Enter fullscreen mode Exit fullscreen mode
area temp_change std_deviation
Pitcairn Islands 0.157284 0.713095
Marshall Islands 0.178335 nan
South Georgia and the South Sandwich Islands 0.252101 1.11
Micronesia (Federated States of) 0.291996 nan
Chile 0.297607 0.534071
Wake Island 0.306269 nan
Norfolk Island 0.410659 0.594073
Argentina 0.488159 0.91559
Zimbabwe 0.493519 0.764067
Antarctica 0.527987 1.55841

We observe that the majority of countries in this list are small, remote islands located in the southern hemisphere. This finding further supports our previous conclusions that southern continents, particularly Antarctica, are less affected by climate change compared to other regions.

Temperature Changes during Summer and Winter

Now, let's delve into more complex queries using the hierarchical dataframe. In this specific use case, our focus is on analyzing temperature changes during winters and summers. For the purpose of this analysis, we define winters as the months of December, January, and February, while summers encompass June, July, and August. By leveraging the power of pandas and the hierarchical dataframe, we can easily visualize the temperature changes during these seasons in Europe. Here's an example code snippet to accomplish that:

all_winters = df[df.index.get_level_values(level="timestamp").map(lambda x: x.month in [12, 1, 2])]
all_summers = df[df.index.get_level_values(level="timestamp").map(lambda x: x.month in [6, 7, 8])]
winters_europe = all_winters.xs("Europe", level="area").sort_index()
summers_europe = all_summers.xs("Europe", level="area").sort_index()

fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.plot(winters_europe.index, winters_europe.temp_change, label="Winters", marker="o", markersize=4)
ax.plot(summers_europe.index, summers_europe.temp_change, label="Summers", marker='o', markersize=4)
ax.grid()
ax.legend()
ax.set_ylabel("Temperature Change [°C]")
Enter fullscreen mode Exit fullscreen mode

Temperature changes during winters and summers

From this figure, we can observe that temperature changes during the winters exhibit greater volatility compared to temperature changes during the summers. To quantify this difference, let's calculate the standard deviation of the temperature changes for both seasons:

pd.concat([winters_europe.std().rename("winters"), summers_europe.std().rename("summers")], axis="columns")
Enter fullscreen mode Exit fullscreen mode
winters summers
temp_change 1.82008 0.696666

Conclusion

In conclusion, mastering MultiIndexes in Pandas provides a powerful tool for handling complex data analysis tasks. By leveraging MultiIndexes, users can efficiently organize and analyze multidimensional datasets in a flexible and intuitive manner. The ability to work with hierarchical structures for rows and columns enhances code clarity, simplifies data handling, and enables simultaneous analysis of multiple variables. Whether it's tracking health parameters of athletes or analyzing Earth's temperature changes over time, understanding and utilizing MultiIndexes in Pandas unlocks the full potential of the library for handling complex data scenarios.

You can find all code included in this post here: https://github.com/GlennViroux/pandas-multi-index-blog

Top comments (0)