Leandro Ruiz

Posted on Aug 14, 2020

Group Your Data by the Index

#datascience #machinelearning #python #tutorial

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

The Problem

I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.
Here are the first five rows:

id	pay_normal	prod_code	prod_limit	update_date	new_balance	highest_balance	report_date
54982353	19	10	NaN	16/07/2011	31677.6	204037.0	14/12/2015
54982353	11	2	NaN	21/02/2001	0.0	248874.0	22/06/2007
54982353	1	10	NaN	08/01/2006	0.0	1622.0	NaN
54982353	1	10	NaN	21/11/2006	0.0	NaN	NaN
54982353	0	10	NaN	30/09/2005	0.0	NaN	NaN

As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.

Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn't find anything.

For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.

Alt Text — [Link to GitHub](https://github.com/ruizleandro/Risk-Management-with-LightGBM)

The Solution

To get around this problem you need the well-known groupby of Pandas.

First, we are going to group the prod_code according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code values into a list with the following code:

df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)

This is the output:

id	prod_code
54982353	[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...
54982356	[27, 27, 10, 6, 6, 12, 10]
54982387	[1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10]
54982463	[5, 13]
54982530	[10, 10, 10, 10]
54982549	[17, 6, 6, 1, 2, 5, 17, 17, 17, 17]
54982579	[2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ...
54982665	[10, 10, 5, 10]
54982697	[10, 10]
54982721	[10, 6, 1, 5, 13, 10, 5, 10, 13, 12]
54982725	[10]

As we can see, now we have all the unique different values of prod_code (one for every transaction made) grouped by the ID.
The next feature that I needed to group was new_balance , where I decided to get the mean of all the new balances for every prod_code:

df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)

id	new_balance
54982353	42033.133333
54982356	2277.771429
54982387	117455.127273
54982463	39890.400000
54982530	1261.200000
54982549	410060.400000
54982579	86454.654545
54982665	40112.400000
54982697	8752.800000
54982721	49843.440000
54982725	15817.200000

Now, I had to group all the different types of overall payment: OVD_t1, OVD_t2, and OVD_t3. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.
For this I had to use the function reduce:

from functools import reduce

df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)

id	OVD_t1	OVD_t2	OVD_t3
54982353	3	2	38
54982356	0	0	0
54982387	3	0	0
54982463	0	0	0
54982530	0	0	0
54982549	6	2	0
54982579	1	0	0
54982665	0	2	26
54982697	0	0	0
54982721	2	1	43
54982725	0	0	0

The final feature that I needed to group was pay_normal. Here, the only operation that I applied was np.sum because I needed the total of times that the payment was normal for every client.

df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)

id	pay_normal
54982353	229
54982356	117
54982387	246
54982463	34
54982530	56
54982549	211
54982579	323
54982665	34
54982697	45
54982721	109
54982725	36

Merging all the Dataframes into one

The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce function again in the same form as in the grouping of the OVD_types before.

dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs) 
df_final.head(10)

id	prod_code	new_balance	OVD_t1	OVD_t2	OVD_t3	pay_normal
54982353	[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1...	42033.133333	3	2	38	229
54982356	[27, 27, 10, 6, 6, 12, 10]	2277.771429	0	0	0	117
54982387	[1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10]	117455.127273	3	0	0	246
54982463	[5, 13]	39890.400000	0	0	0	34
54982530	[10, 10, 10, 10]	1261.200000	0	0	0	56
54982549	[17, 6, 6, 1, 2, 5, 17, 17, 17, 17]	410060.400000	6	2	0	211
54982579	[2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ...	86454.654545	1	0	0	323
54982665	[10, 10, 5, 10]	40112.400000	0	2	26	34
54982697	[10, 10]	8752.800000	0	0	0	45
54982721	[10, 6, 1, 5, 13, 10, 5, 10, 13, 12]	49843.440000	2	1	43	109
54982725	[10]	15817.200000	0	0	0	36

Conclusion

And that's it. I hope this tutorial helps you, and thanks for getting here.

DEV Community

Group Your Data by the Index

A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.

The Problem

The Solution

Merging all the Dataframes into one

Conclusion

Top comments (0)

Read next

Understanding Color Gradients in Digital Design: A Comprehensive Guide

How to Create a WhatsApp Chatbot Using WhatsApp API in Laravel

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Advent of Code 2024 - Day 15: Warehouse Woes