Mage AI

Posted on Mar 2, 2022

Feature Engineering - Min/Max Aggregate

#machinelearning #rideshare #mage #python

TLDR

In this lesson, we’ll learn about the aggregate functions min() and max(), and see how they’re helpful in analyzing and understanding the data.

Glossary

Data Aggregation
Why is it necessary
Definition
Example
How to code

Data Aggregation

Data aggregation is known as summarization of data. Some of the most common aggregate functions are min(), max(), mean(), count(), sum() etc.

Why is it necessary

Data aggregation is a part of the data analysis process. Data analysis is the first and most critical step of model building. This allows us to delve deeper into the data and help us understand the data better.

Definition

In this lesson, we’ll explore min() and max() functions in detail.

min(): This function helps us find the minimum or least value in a feature or column.
max(): This function helps us find the maximum or highest value in a feature or column.

We can apply aggregate functions in 2 different ways:
Case-1: Apply aggregate functions on a single feature or column i.e., analyzing each column individually.
Case-2: Apply aggregate functions on groups i.e., we’ll group rows and analyze each group individually.

Example

Consider a dataset with 2 columns "Product" and "Price". Let’s apply aggregate functions (min() and max()) to find minimum and maximum value in the “Price” column.

Find minimum price

Find maximum price

Grouping is a 3 step process as shown below:
Step-1: Split the rows into groups based on the “Product” column.

There are 3 unique products (Laptop, Desk, Chair) in the “Product” column, so the rows are split into 3 groups.

Step-2: Find the minimum price of each unique product

Step-3: Display the output. For this, we’ll combine each group’s output to form a data frame and display the data frame.

Steps to find minimum value of each unique product

Steps to find maximum value of each unique product

How to code

In recent years, the popularity of ridesharing has skyrocketed. The key benefits of ridesharing are that it’s inexpensive, convenient, and allows anyone to easily travel from 1 location to another.

Image by mohamed Hassan from Pixabay

Service providers frequently change prices based on time, traffic, the number of cabs available, and other factors. As costs fluctuate, it's beneficial to offer users a range of prices for a specific route. So, with the help of rides data, let’s find the minimum and maximum prices for each unique route.

Find the minimum and maximum price of each unique route.

Step-1:
First let’s group rides by source and then by destination. To do this, we’ll iterate through the rows of rides data and save the “source” as keys of the dictionary. The final result should be as shown below.

Output format: {‘sourceA’: [(destination1, price1), (destination1, price2),...], ‘sourceB’:[(destination1, price1), (destination1, price2),...],....}

Step-2:
Find minimum price

By comparing the prices of routes with the same starting location and destination, we'll find the minimum price of each route.

Lowest price of each unique route

Find maximum price
By comparing the prices of routes with the same starting point and destination, we'll find the highest price for each route.

Highest price of each unique route

From the output, we see that the price from “Haymarket Square” to “North Station” ranges between 3.0 and 32.5, “Haymarket Square” to “West End” ranges between 3.0 and 27.5, etc.

Group rows of the same route, and find the minimum and maximum price of each individual route.

Pandas has a built-in function groupby() that’s used to group rows in a dataset. This function is used along with min() and max() functions to find minimum and maximum values of each unique group.

Find minimum price

Find maximum price

Magical no code solution

For quick analysis and results, try our product, Mage. Our service features an "Edit data" area with multiple aggregation options. Apart from analyzing the data, you can create a new column and store the aggregation results that help in further analysis of the data.