Glenn Viroux

Posted on Jun 18, 2023

Mastering MultiIndexes in Pandas: A Powerful Tool for Complex Data Analysis

Pandas is a widely used data manipulation library in Python that offers extensive capabilities for handling various types of data. One of its notable features is the ability to work with MultiIndexes, also known as hierarchical indexes. In this blog post, we will delve into the concept of MultiIndexes and explore how they can be leveraged to tackle complex, multidimensional datasets.

Understanding MultiIndexes: Analyzing Sports Performance Data

A MultiIndex is a pandas data structure that allows indexing and accessing data across multiple dimensions or levels. It enables the creation of hierarchical structures for rows and columns, providing a flexible way to organize and analyze data. To illustrate this, let's consider a scenario where you are a personal trainer or coach monitoring the health parameters of your athletes during their sports activities. You want to track various parameters such as heart rate, running pace, and cadence over a specific time interval.

Synthetic Health Performance Data

To work with this type of data, let's begin by writing Python code that simulates health performance data, specifically heart rates and running cadences:

from __future__ import annotations
from datetime import datetime, timedelta
import numpy as np
import pandas as pd

start = datetime(2023, 6, 8, 14)
end = start + timedelta(hours=1, minutes=40)
timestamps = pd.date_range(start, end, freq=timedelta(minutes=1), inclusive='left')

def get_heart_rate(begin_hr: int, end_hr: int, break_point: int) -> pd.Series[float]:
    noise = np.random.normal(loc=0.0, scale=3, size=100)
    heart_rate = np.concatenate((np.linspace(begin_hr, end_hr, num=break_point), [end_hr] * (100 - break_point))) + noise
    return pd.Series(data=heart_rate, index=timestamps)

def get_cadence(mean_cadence: int) -> pd.Series[float]:
    noise = np.random.normal(loc=0.0, scale=1, size=100)
    cadence = pd.Series(data=[mean_cadence] * 100 + noise, index=timestamps)
    cadence[::3] = np.NAN
    cadence[1::3] = np.NAN
    return cadence.ffill().fillna(mean_cadence)

The code snippet provided showcases the generation of synthetic data for heart rate and cadence during a sports activity. It begins by importing the necessary modules such as datetime, numpy, and pandas.

The duration of the sports activity is defined as 100 minutes, and the pd.date_range function is utilized to generate a series of timestamps at one-minute intervals to cover this period.

The get_heart_rate function generates synthetic heart rate data, assuming a linear increase in heart rate up to a certain level, followed by a constant level for the remainder of the activity. Gaussian noise is introduced to add variability to the heart rate data, making it more realistic.

Similarly, the get_cadence function generates synthetic cadence data, assuming a relatively constant cadence throughout the activity. Gaussian noise is added to create variability in the cadence values, with the noise values being updated every three minutes instead of every minute, reflecting the stability of cadence compared to heart rates.

With the data generation functions in place, it is now possible to create synthetic data for two athletes, Bob and Alice:

bob_hr = get_heart_rate(begin_hr=110, end_hr=160, break_point=20)
alice_hr = get_heart_rate(begin_hr=90, end_hr=140, break_point=50)
bob_cadence = get_cadence(mean_cadence=175)
alice_cadence = get_cadence(mean_cadence=165)

At this point, we have the heart rates and cadences of Bob and Alice. Let's plot them using matplotlib to get some more insight into the data:

from __future__ import annotations
import matplotlib.dates as mdates
import matplotlib.pyplot as plt

date_formatter = mdates.DateFormatter('%H:%M:%S')  # Customize the date format as needed

fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(date_formatter)
ax.plot(bob_hr, color="red", label="Heart Rate Bob", marker=".")
ax.plot(alice_hr, color="red", label="Heart Rate Alice", marker="v")
ax.grid()
ax.legend()
ax.set_ylabel("Heart Rate [BPM]")
ax.set_xlabel("Time")

ax_cadence = ax.twinx()
ax_cadence.plot(bob_cadence, color="purple", label="Cadence Bob", marker=".", alpha=0.5)
ax_cadence.plot(alice_cadence, color="purple", label="Cadence Alice", marker="v", alpha=0.5)
ax_cadence.legend()
ax_cadence.set_ylabel("Cadence [SPM]")
ax_cadence.set_ylim(158, 180)

Great! The initial analysis of the data provides interesting observations. We can easily distinguish the differences between Bob and Alice in terms of their maximum heart rate and the rate at which it increases. Additionally, Bob's cadence appears to be notably higher than Alice's.

Using Dataframes for Scalability

However, as you might have already noticed, the current approach of using separate variables (bob_hr, alice_hr, bob_cadence, and alice_cadence) for each health parameter and athlete is not scalable. In real-world scenarios with a larger number of athletes and health parameters, this approach quickly becomes impractical and cumbersome.

To address this issue, we can leverage the power of pandas by utilizing a pandas DataFrame to represent the data for multiple athletes and health parameters. By organizing the data in a tabular format, we can easily manage and analyze multiple variables simultaneously.

Each row of the DataFrame can correspond to a specific timestamp, and each column can represent a health parameter for a particular athlete. This structure allows for efficient storage and manipulation of multidimensional data.

By using a DataFrame, we can eliminate the need for separate variables and store all the data in a single object. This enhances code clarity, simplifies data handling, and provides a more intuitive representation of the overall dataset.

bob_df = pd.concat([bob_hr.rename("heart_rate"), bob_cadence.rename("cadence")], axis="columns")

This is what the Dataframe for Bob’s health data looks like:

	heart_rate	cadence
2023-06-08 14:00:00	112.359	175
2023-06-08 14:01:00	107.204	175
2023-06-08 14:02:00	116.617	175.513
2023-06-08 14:03:00	121.151	175.513
2023-06-08 14:04:00	123.27	175.513
2023-06-08 14:05:00	120.901	174.995
2023-06-08 14:06:00	130.24	174.995
2023-06-08 14:07:00	131.15	174.995
2023-06-08 14:08:00	131.402	174.669

Introducing Hierarchical Dataframes

The last dataframe looks better already! But now we still have to create a new dataframe for each athlete. This is where pandas MultiIndex can help. Let's take a look at how we can elegantly merge the data of multiple athletes and health parameters into one dataframe:

from itertools import product
bob_df = bob_hr.to_frame("value")
bob_df["athlete"] = "Bob"
bob_df["parameter"] = "heart_rate"

values = {
    "Bob": {
        "heart_rate": bob_hr,
        "cadence": bob_cadence,
    },
    "Alice": {
        "heart_rate": alice_hr,
        "cadence": alice_cadence
    }
}

sub_dataframes: list[pd.DataFrame] = []
for athlete, parameter in product(["Bob", "Alice"], ["heart_rate", "cadence"]):
    sub_df = values[athlete][parameter].to_frame("values")
    sub_df["athlete"] = athlete
    sub_df["parameter"] = parameter
    sub_dataframes.append(sub_df)

df = pd.concat(sub_dataframes).set_index(["athlete", "parameter"], append=True)
df.index = df.index.set_names(["timestamps", "athlete", "parameter"])

This code processes heart rate and cadence data for athletes Bob and Alice. It performs the following steps:

Create a DataFrame for Bob's heart rate data and add metadata columns for athlete and parameter.
Define a dictionary that stores heart rate and cadence data for Bob and Alice.
Generate combinations of athletes and parameters (Bob/Alice and heart_rate/cadence).
For each combination, create a sub-dataframe with the corresponding data and metadata columns.
Concatenate all sub-dataframes into a single dataframe.
Set the index to include levels for timestamps, athlete, and parameter. This is where the actual MultiIndex is created

This is what the hierarchical dataframe df looks like:

	values
(Timestamp('2023-06-08 14:00:00'), 'Bob', 'heart_rate')	112.359
(Timestamp('2023-06-08 14:01:00'), 'Bob', 'heart_rate')	107.204
(Timestamp('2023-06-08 14:02:00'), 'Bob', 'heart_rate')	116.617
(Timestamp('2023-06-08 14:03:00'), 'Bob', 'heart_rate')	121.151
(Timestamp('2023-06-08 14:04:00'), 'Bob', 'heart_rate')	123.27
(Timestamp('2023-06-08 14:05:00'), 'Bob', 'heart_rate')	120.901
(Timestamp('2023-06-08 14:06:00'), 'Bob', 'heart_rate')	130.24
(Timestamp('2023-06-08 14:07:00'), 'Bob', 'heart_rate')	131.15
(Timestamp('2023-06-08 14:08:00'), 'Bob', 'heart_rate')	131.402

At this point, we have got ourselves a single dataframe that holds all information for an arbitrary amount of athletes and health parameters. We can now easily use the .xs method to query the hierarchical dataframe:

df.xs("Bob", level="athlete")  # get all health data for Bob

	values
(Timestamp('2023-06-08 14:00:00'), 'heart_rate')	112.359
(Timestamp('2023-06-08 14:01:00'), 'heart_rate')	107.204
(Timestamp('2023-06-08 14:02:00'), 'heart_rate')	116.617
(Timestamp('2023-06-08 14:03:00'), 'heart_rate')	121.151
(Timestamp('2023-06-08 14:04:00'), 'heart_rate')	123.27

df.xs("heart_rate", level="parameter")  *# get all heart rates*

	values
(Timestamp('2023-06-08 14:00:00'), 'Bob')	112.359
(Timestamp('2023-06-08 14:01:00'), 'Bob')	107.204
(Timestamp('2023-06-08 14:02:00'), 'Bob')	116.617
(Timestamp('2023-06-08 14:03:00'), 'Bob')	121.151
(Timestamp('2023-06-08 14:04:00'), 'Bob')	123.27

df.xs("Bob", level="athlete").xs("heart_rate", level="parameter")  # get heart_rate data for Bob

timestamps	values
2023-06-08 14:00:00	112.359
2023-06-08 14:01:00	107.204
2023-06-08 14:02:00	116.617
2023-06-08 14:03:00	121.151
2023-06-08 14:04:00	123.27

Use Case: Earth Temperature Changes

To demonstrate the power of hierarchical dataframes, let's explore a real-world and complex use case: analyzing the changes in Earth's surface temperatures over the last decades. For this task, we'll utilize a dataset available on Kaggle, which summarizes the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).

Inspect and Transform Original Data

Let's begin by reading and inspecting the data. This step is crucial to gain a better understanding of the dataset's structure and contents before delving into the analysis. Here's how we can accomplish that using pandas:

from pathlib import Path

file_path = Path() / "data" / "Environment_Temperature_change_E_All_Data_NOFLAG.csv"
df = pd.read_csv(file_path , encoding='cp1252')
df.describe()

	Area Code	Months Code	Element Code	Y1961	Y1962	Y1963	Y1964	Y1965	Y1966	Y1967	Y1968	Y1969	Y1970	Y1971	Y1972	Y1973	Y1974	Y1975	Y1976	Y1977	Y1978	Y1979	Y1980	Y1981	Y1982	Y1983	Y1984	Y1985	Y1986	Y1987	Y1988	Y1989	Y1990	Y1991	Y1992	Y1993	Y1994	Y1995	Y1996	Y1997	Y1998	Y1999	Y2000	Y2001	Y2002	Y2003	Y2004	Y2005	Y2006	Y2007	Y2008	Y2009	Y2010	Y2011	Y2012	Y2013	Y2014	Y2015	Y2016	Y2017	Y2018	Y2019
count	9656	9656	9656	8287	8322	8294	8252	8281	8364	8347	8345	8326	8308	8303	8323	8394	8374	8280	8209	8257	8327	8290	8283	8276	8237	8205	8259	8216	8268	8284	8273	8257	8239	8158	8354	8315	8373	8409	8439	8309	8370	8324	8342	8241	8312	8390	8415	8424	8503	8534	8475	8419	8435	8437	8350	8427	8377	8361	8348	8366	8349	8365
mean	821.806	7009.88	6674.5	0.402433	0.315527	0.317393	0.269382	0.217839	0.376419	0.263239	0.24487	0.382172	0.365322	0.240934	0.302553	0.427691	0.261849	0.314653	0.221112	0.422978	0.355488	0.442465	0.43827	0.437693	0.404857	0.503748	0.366971	0.365511	0.398096	0.535514	0.546662	0.469231	0.621797	0.499991	0.447798	0.439094	0.611078	0.635836	0.477239	0.617341	0.818264	0.704445	0.674191	0.741673	0.802509	0.769485	0.726237	0.777465	0.791795	0.842554	0.742614	0.814177	0.884504	0.768488	0.78893	0.829647	0.913872	1.01882	1.08149	1.00334	1.01083	1.0946
std	1781.07	6.03825	596.531	0.701567	0.713777	0.853133	0.749216	0.739418	0.73737	0.725421	0.7549	0.725313	0.662412	0.727313	0.765895	0.677769	0.76885	0.723964	0.755587	0.677094	0.662228	0.670377	0.638585	0.70157	0.675693	0.749268	0.700364	0.765491	0.710202	0.756869	0.664038	0.758738	0.724396	0.656754	0.843925	0.81786	0.752511	0.754133	0.72521	0.741812	0.785076	0.723811	0.783754	0.781683	0.840545	0.794839	0.644364	0.720366	0.821489	0.854893	0.852562	0.694464	0.878303	0.750631	0.858586	0.713753	0.815933	0.840189	0.877399	0.8098	0.872199	0.853953
min	1	7001	6078	-4.018	-5.391	-8.483	-7.309	-4.728	-8.147	-6.531	-8.407	-6.784	-5.847	-7.671	-7.722	-4.896	-4.732	-6.169	-4.263	-6.495	-8.228	-6.319	-5.784	-6.591	-4.55	-6.101	-5.437	-8.411	-6.345	-8.75	-8.963	-5.311	-4.03	-4.598	-5.414	-7.389	-5.099	-4.862	-4.027	-4.059	-6.031	-3.035	-3.596	-5.493	-6.17	-6.118	-5.025	-5.171	-4.981	-3.19	-9.334	-3.543	-6.072	-4.854	-5.785	-3.642	-5.367	-4.068	-3.306	-3.584	-2.216	-2.644
25%	78	7005	6078	0.057	-0.033	0.03025	-0.1025	-0.214	0.055	-0.169	-0.164	0.171	0.094	-0.209	-0.028	0.201	-0.184	-0.115	-0.219	0.174	0.091	0.217	0.2455	0.205	0.168	0.268	0.077	0.123	0.164	0.293	0.285	0.186	0.298	0.264	0.18425	0.214	0.289	0.324	0.244	0.305	0.385	0.327	0.3	0.349	0.38	0.37525	0.361	0.373	0.371	0.378	0.337	0.381	0.392	0.365	0.37225	0.4085	0.418	0.437	0.457	0.443	0.434	0.455
50%	153.5	7009	6674.5	0.366	0.333	0.355	0.326	0.303	0.36	0.313	0.312	0.385	0.367	0.305	0.346	0.413	0.305	0.325	0.309	0.388	0.35	0.406	0.424	0.409	0.39	0.468	0.373	0.374	0.38	0.504	0.478	0.404	0.49	0.448	0.419	0.427	0.493	0.539	0.45	0.528	0.715	0.548	0.514	0.596	0.668	0.658	0.597	0.676	0.646	0.667	0.589	0.693	0.771	0.64	0.651	0.719	0.745	0.858	0.949	0.865	0.81	0.939
75%	226.25	7016	7271	0.6765	0.627	0.64775	0.609	0.584	0.66025	0.601	0.595	0.677	0.642	0.5885	0.628	0.709	0.586	0.625	0.586	0.695	0.633	0.69775	0.701	0.709	0.689	0.825	0.666	0.674	0.678	0.854	0.802	0.728	0.8485	0.758	0.777	0.748	0.874	0.905	0.777	0.938	1.188	0.98825	0.937	1.028	1.08425	1.063	0.991	1.104	1.082	1.096	1.0425	1.1345	1.2765	1.091	1.108	1.126	1.19	1.389	1.496	1.36475	1.341	1.508
max	5873	7020	7271	5.771	4.373	4.666	5.233	5.144	5.771	4.768	4.373	4.411	4.373	4.373	9.475	6.304	6.912	6.15	7.689	4.875	5.956	5.483	4.519	6.144	5.411	6.513	9.303	5.948	6.845	4.516	4.982	6.841	9.73	6.11	6.017	5.989	6.477	7.221	7.27	5.637	6.816	7.017	5.836	6.092	8.719	6.171	7.461	7.651	11.331	7.655	8.298	6.415	7.19	6.531	10.826	6.738	11.759	7.59	10.478	7.389	9.228	7.215

From this initial inspection, it becomes evident that the data is organized in a single dataframe, with separate rows for different months and countries. However, the values for different years are spread across several columns in the dataframe, labeled with the prefix 'Y'. This format makes it challenging to read and visualize the data effectively. To address this issue, we will transform the data into a more structured and hierarchical dataframe format, enabling us to query and visualize the data more conveniently.

from dataclasses import dataclass, field
from datetime import date
from pydantic import BaseModel

MONTHS = {
    "January": 1, 
    "February": 2, 
    "March": 3, 
    "April": 4, 
    "May": 5, 
    "June": 6, 
    "July": 7, 
    "August": 8,
  "September": 9, 
    "October": 10,
  "November": 11, 
    "December": 12
}

class GistempDataElement(BaseModel):
    area: str
    timestamp: date
    value: float

@dataclass
class GistempTransformer:
    temperature_changes: list[GistempDataElement] = field(default_factory=list)
    standard_deviations: list[GistempDataElement] = field(default_factory=list)

    def _process_row(self, row) -> None:
        relevant_elements = ["Temperature change", "Standard Deviation"]
        if (element := row["Element"]) not in relevant_elements or (month := MONTHS.get(row["Months"])) is None:
            return None

        for year, value in row.filter(regex="Y.*").items():
            new_element = GistempDataElement(
                timestamp=date(year=int(year.replace("Y", "")), month=month, day=1),
                area=row["Area"],
                value=value
            )
            if element == "Temperature change":
                self.temperature_changes.append(new_element)
            else:
                self.standard_deviations.append(new_element)

    @property
    def df(self) -> pd.DataFrame:
        temp_changes_df = pd.DataFrame.from_records([elem.dict() for elem in self.temperature_changes])
        temp_changes = temp_changes_df.set_index(["timestamp", "area"]).rename(columns={"value": "temp_change"})

        std_deviations_df = pd.DataFrame.from_records([elem.dict() for elem in self.standard_deviations])
        std_deviations = std_deviations_df.set_index(["timestamp", "area"]).rename(columns={"value": "std_deviation"})

        return pd.concat([temp_changes, std_deviations], axis="columns")

    def process(self):
        environment_data = Path() / "data" / "Environment_Temperature_change_E_All_Data_NOFLAG.csv"
        df = pd.read_csv(environment_data, encoding='cp1252')
        df.apply(self._process_row, axis="columns")

This code introduces the GistempTransformer class, which demonstrates the processing of temperature data from a CSV file and the creation of a hierarchical DataFrame containing temperature changes and standard deviations.

The GistempTransformer class, defined as a dataclass, includes two lists, temperature_changes and standard_deviations, to store the processed data elements. The _process_row method is responsible for handling each row of the input DataFrame. It checks for relevant elements, such as "Temperature change" and "Standard Deviation," extracts the month from the "Months" column, and creates instances of the GistempDataElement class. These instances are then appended to the appropriate lists based on the element type.

The df property returns a DataFrame by combining the temperature_changes and standard_deviations lists. This hierarchical DataFrame has a MultiIndex with levels representing the timestamp and area, providing a structured organization of the data.

transformer = GistempTransformer()
transformer.process()
df = transformer.df

	temp_change	std_deviation
(datetime.date(1961, 1, 1), 'Afghanistan')	0.777	1.95
(datetime.date(1962, 1, 1), 'Afghanistan')	0.062	1.95
(datetime.date(1963, 1, 1), 'Afghanistan')	2.744	1.95
(datetime.date(1964, 1, 1), 'Afghanistan')	-5.232	1.95
(datetime.date(1965, 1, 1), 'Afghanistan')	1.868	1.95

Analyzing Climate Data

Now that we have consolidated all the relevant data into a single dataframe, we can proceed with inspecting and visualizing the data. Our focus is on examining the linear regression lines for each area, as they provide insights into the overall trend of temperature changes over the past decades. To facilitate this visualization, we will create a function that plots the temperature changes along with their corresponding regression lines.

def plot_temperature_changes(areas: list[str]) -> None:
    fig = plt.figure(figsize=(12, 6))
    ax1 = fig.add_subplot(211)
    ax2 = fig.add_subplot(212)

    for area in areas:
        df_country = df[df.index.get_level_values("area") == area].reset_index()
        dates = df_country["timestamp"].map(datetime.toordinal)
        gradient, offset = np.polyfit(dates, df_country.temp_change, deg=1)
        ax1.scatter(df_country.timestamp, df_country.temp_change, label=area, s=5)
        ax2.plot(df_country.timestamp, gradient * dates + offset, label=area)

    ax1.grid()
    ax2.grid()
    ax2.legend()
    ax2.set_ylabel("Regression Lines [°C]")
    ax1.set_ylabel("Temperature change [°C]")

In this function, we are using the **get_level_values** method on a pandas MultiIndex to efficiently query the data in our hierarchical Dataframe on different levels. Let's use this function to visualize temperature changes in the different continents:

plot_temperature_changes(["Africa", "Antarctica", "Americas", "Asia", "Europe", "Oceania"])

From this plot, we can draw several key conclusions:

The regression lines for all continents have a positive gradient, indicating a global trend of increasing Earth surface temperatures.
The regression line for Europe is notably steeper compared to other continents, implying that the temperature increase in Europe has been more pronounced. This finding aligns with observations of accelerated warming in Europe compared to other regions.
The specific factors contributing to the higher temperature increase in Europe compared to Antarctica are complex and require detailed scientific research. However, one contributing factor may be the influence of ocean currents. Europe is influenced by warm ocean currents, such as the Gulf Stream, which transport heat from the tropics towards the region. These currents play a role in moderating temperatures and can contribute to the relatively higher warming observed in Europe. In contrast, Antarctica is surrounded by cold ocean currents, and its climate is heavily influenced by the Southern Ocean and the Antarctic Circumpolar Current, which act as barriers to the incursion of warmer waters, thereby limiting the warming effect.

Now, let's focus our analysis on Europe itself by examining temperature changes in different regions within Europe. We can achieve this by creating individual plots for each European region:

plot_temperature_changes(["Southern Europe", "Eastern Europe", "Northern Europe", "Western Europe"])

From the plotted temperature changes in different regions of Europe, we observe that the overall temperature rises across the European continent are quite similar. While there may be slight variations in the steepness of the regression lines between regions, such as Eastern Europe having a slightly steeper line compared to Southern Europe, no significant differences can be observed among the regions.

Ten Countries Most and Less Affected by Climate Change

Now, let's shift our focus to identifying the top 10 countries that have experienced the highest average temperature increase since the year 2000. Here's an example of how we can retrieve the list of countries:

df[df.index.get_level_values(level="timestamp") > date(2000, 1, 1)].groupby("area").mean().sort_values(by="temp_change",ascending=False).head(10)

area	temp_change	std_deviation
Svalbard and Jan Mayen Islands	2.61541	2.48572
Estonia	1.69048	nan
Kuwait	1.6825	1.12843
Belarus	1.66113	nan
Finland	1.65906	2.15634
Slovenia	1.6555	nan
Russian Federation	1.64507	nan
Bahrain	1.64209	0.937431
Eastern Europe	1.62868	0.970377
Austria	1.62721	1.56392

To extract the top 10 countries with the highest average temperature increase since the year 2000, we perform the following steps:

Filter the dataframe to include only rows where the year is greater than or equal to 2000 using df.index.get_level_values(level='timestamp') >= date(2000, 1, 1).
Group the data by the 'Area' (country) using .groupby('area').
Calculate the mean temperature change for each country using .mean().
Select the top 10 countries with the largest mean temperature change using **.sort_values(by="temp_change",ascending=True).head(10)**.

This result aligns with our previous observations, confirming that Europe experienced the highest rise in temperature compared to other continents.

Continuing with our analysis, let's now explore the ten countries that are least affected by the rise in temperature. We can utilize the same method as before to extract this information. Here's an example of how we can retrieve the list of countries:

df[df.index.get_level_values(level="timestamp") > date(2000, 1, 1)].groupby("area").mean().sort_values(by="temp_change",ascending=True).head(10)

area	temp_change	std_deviation
Pitcairn Islands	0.157284	0.713095
Marshall Islands	0.178335	nan
South Georgia and the South Sandwich Islands	0.252101	1.11
Micronesia (Federated States of)	0.291996	nan
Chile	0.297607	0.534071
Wake Island	0.306269	nan
Norfolk Island	0.410659	0.594073
Argentina	0.488159	0.91559
Zimbabwe	0.493519	0.764067
Antarctica	0.527987	1.55841

We observe that the majority of countries in this list are small, remote islands located in the southern hemisphere. This finding further supports our previous conclusions that southern continents, particularly Antarctica, are less affected by climate change compared to other regions.

Temperature Changes during Summer and Winter

Now, let's delve into more complex queries using the hierarchical dataframe. In this specific use case, our focus is on analyzing temperature changes during winters and summers. For the purpose of this analysis, we define winters as the months of December, January, and February, while summers encompass June, July, and August. By leveraging the power of pandas and the hierarchical dataframe, we can easily visualize the temperature changes during these seasons in Europe. Here's an example code snippet to accomplish that:

all_winters = df[df.index.get_level_values(level="timestamp").map(lambda x: x.month in [12, 1, 2])]
all_summers = df[df.index.get_level_values(level="timestamp").map(lambda x: x.month in [6, 7, 8])]
winters_europe = all_winters.xs("Europe", level="area").sort_index()
summers_europe = all_summers.xs("Europe", level="area").sort_index()

fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.plot(winters_europe.index, winters_europe.temp_change, label="Winters", marker="o", markersize=4)
ax.plot(summers_europe.index, summers_europe.temp_change, label="Summers", marker='o', markersize=4)
ax.grid()
ax.legend()
ax.set_ylabel("Temperature Change [°C]")

From this figure, we can observe that temperature changes during the winters exhibit greater volatility compared to temperature changes during the summers. To quantify this difference, let's calculate the standard deviation of the temperature changes for both seasons:

pd.concat([winters_europe.std().rename("winters"), summers_europe.std().rename("summers")], axis="columns")

	winters	summers
temp_change	1.82008	0.696666

Conclusion

In conclusion, mastering MultiIndexes in Pandas provides a powerful tool for handling complex data analysis tasks. By leveraging MultiIndexes, users can efficiently organize and analyze multidimensional datasets in a flexible and intuitive manner. The ability to work with hierarchical structures for rows and columns enhances code clarity, simplifies data handling, and enables simultaneous analysis of multiple variables. Whether it's tracking health parameters of athletes or analyzing Earth's temperature changes over time, understanding and utilizing MultiIndexes in Pandas unlocks the full potential of the library for handling complex data scenarios.

You can find all code included in this post here: https://github.com/GlennViroux/pandas-multi-index-blog

DEV Community

Mastering MultiIndexes in Pandas: A Powerful Tool for Complex Data Analysis

Understanding MultiIndexes: Analyzing Sports Performance Data

Synthetic Health Performance Data

Using Dataframes for Scalability

Introducing Hierarchical Dataframes

Use Case: Earth Temperature Changes

Inspect and Transform Original Data

Analyzing Climate Data

Ten Countries Most and Less Affected by Climate Change

Temperature Changes during Summer and Winter

Conclusion

Top comments (0)

Read next

Creating a Local Laravel Package

Custom Web Application Development: Building Solutions that Fit Your Business Like a Glove

Meetup Online de Elixir no dia 21/05/2024

Angular + Gemini