A few days ago, I was working on a project on risk management, until I came across a simple problem that I had never seen.
The Problem
I needed to group data on the different IDs of a transaction dataset. This dataset contains 8,250 transactions for 1,125 customers of a bank.
Here are the first five rows:
id | OVD_t1 | OVD_t2 | OVD_t3 | OVD_sum | pay_normal | prod_code | prod_limit | update_date | new_balance | highest_balance | report_date |
---|---|---|---|---|---|---|---|---|---|---|---|
54982353 | 0 | 0 | 0 | 0 | 19 | 10 | NaN | 16/07/2011 | 31677.6 | 204037.0 | 14/12/2015 |
54982353 | 0 | 0 | 0 | 0 | 11 | 2 | NaN | 21/02/2001 | 0.0 | 248874.0 | 22/06/2007 |
54982353 | 0 | 0 | 0 | 0 | 1 | 10 | NaN | 08/01/2006 | 0.0 | 1622.0 | NaN |
54982353 | 0 | 0 | 0 | 0 | 1 | 10 | NaN | 21/11/2006 | 0.0 | NaN | NaN |
54982353 | 0 | 0 | 0 | 0 | 0 | 10 | NaN | 30/09/2005 | 0.0 | NaN | NaN |
As you can see, all the rows have the same ID and I needed to group all the transactions according to the ID that created them.
Without having the solution in my head, I started looking in my Data Science books and more than anything those that were about data analysis and feature engineering but I didn't find anything.
For this reason I started looking at Stack Overflow until I found something that looked like my situation but I had to adapt it to the characteristics of the project.
The Solution
To get around this problem you need the well-known groupby
of Pandas.
First, we are going to group the prod_code
according to their ID (if you want to know more you can enter GitHub with the link above). But we are going to group the various prod_code
values into a list with the following code:
df_0 = payment_data.groupby(['id'])['prod_code'].apply(list).reset_index()
df_0.head(10)
This is the output:
id | prod_code |
---|---|
54982353 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1... |
54982356 | [27, 27, 10, 6, 6, 12, 10] |
54982387 | [1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10] |
54982463 | [5, 13] |
54982530 | [10, 10, 10, 10] |
54982549 | [17, 6, 6, 1, 2, 5, 17, 17, 17, 17] |
54982579 | [2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ... |
54982665 | [10, 10, 5, 10] |
54982697 | [10, 10] |
54982721 | [10, 6, 1, 5, 13, 10, 5, 10, 13, 12] |
54982725 | [10] |
As we can see, now we have all the unique different values of prod_code
(one for every transaction made) grouped by the ID
.
The next feature that I needed to group was new_balance
, where I decided to get the mean of all the new balances for every prod_code
:
df_1 = payment_data.groupby(['id'])['new_balance'].apply(np.mean, axis=0).reset_index()
df_1.head(10)
id | new_balance |
---|---|
54982353 | 42033.133333 |
54982356 | 2277.771429 |
54982387 | 117455.127273 |
54982463 | 39890.400000 |
54982530 | 1261.200000 |
54982549 | 410060.400000 |
54982579 | 86454.654545 |
54982665 | 40112.400000 |
54982697 | 8752.800000 |
54982721 | 49843.440000 |
54982725 | 15817.200000 |
Now, I had to group all the different types of overall payment: OVD_t1
, OVD_t2
, and OVD_t3
. First, I grouped the sum of every type with their correspondent ID and then grouped all the types together in one dataframe.
For this I had to use the function reduce
:
from functools import reduce
df_OVD_t1 = payment_data.groupby(['id'])['OVD_t1'].apply(np.sum, axis=0).reset_index()
df_OVD_t2 = payment_data.groupby(['id'])['OVD_t2'].apply(np.sum, axis=0).reset_index()
df_OVD_t3 = payment_data.groupby(['id'])['OVD_t3'].apply(np.sum, axis=0).reset_index()
OVD_dfs = [df_OVD_t1, df_OVD_t2, df_OVD_t3]
df_2 = reduce(lambda left,right: pd.merge(left,right,on='id'), OVD_dfs)
df_2.head(10)
id | OVD_t1 | OVD_t2 | OVD_t3 |
---|---|---|---|
54982353 | 3 | 2 | 38 |
54982356 | 0 | 0 | 0 |
54982387 | 3 | 0 | 0 |
54982463 | 0 | 0 | 0 |
54982530 | 0 | 0 | 0 |
54982549 | 6 | 2 | 0 |
54982579 | 1 | 0 | 0 |
54982665 | 0 | 2 | 26 |
54982697 | 0 | 0 | 0 |
54982721 | 2 | 1 | 43 |
54982725 | 0 | 0 | 0 |
The final feature that I needed to group was pay_normal
. Here, the only operation that I applied was np.sum
because I needed the total of times that the payment was normal for every client.
df_3 = payment_data.groupby(['id'])['pay_normal'].apply(np.sum, axis=0).reset_index()
df_3.head(10)
id | pay_normal |
---|---|
54982353 | 229 |
54982356 | 117 |
54982387 | 246 |
54982463 | 34 |
54982530 | 56 |
54982549 | 211 |
54982579 | 323 |
54982665 | 34 |
54982697 | 45 |
54982721 | 109 |
54982725 | 36 |
Merging all the Dataframes into one
The final step for this task was to merge all the dataframes that I created into only one. For this, I used the reduce
function again in the same form as in the grouping of the OVD_types
before.
dfs = [df_0, df_1, df_2, df_3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final.head(10)
id | prod_code | new_balance | OVD_t1 | OVD_t2 | OVD_t3 | pay_normal |
---|---|---|---|---|---|---|
54982353 | [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1... | 42033.133333 | 3 | 2 | 38 | 229 |
54982356 | [27, 27, 10, 6, 6, 12, 10] | 2277.771429 | 0 | 0 | 0 | 117 |
54982387 | [1, 6, 6, 6, 6, 1, 10, 10, 1, 10, 10] | 117455.127273 | 3 | 0 | 0 | 246 |
54982463 | [5, 13] | 39890.400000 | 0 | 0 | 0 | 34 |
54982530 | [10, 10, 10, 10] | 1261.200000 | 0 | 0 | 0 | 56 |
54982549 | [17, 6, 6, 1, 2, 5, 17, 17, 17, 17] | 410060.400000 | 6 | 2 | 0 | 211 |
54982579 | [2, 10, 10, 10, 10, 10, 2, 10, 10, 1, 10, 10, ... | 86454.654545 | 1 | 0 | 0 | 323 |
54982665 | [10, 10, 5, 10] | 40112.400000 | 0 | 2 | 26 | 34 |
54982697 | [10, 10] | 8752.800000 | 0 | 0 | 0 | 45 |
54982721 | [10, 6, 1, 5, 13, 10, 5, 10, 13, 12] | 49843.440000 | 2 | 1 | 43 | 109 |
54982725 | [10] | 15817.200000 | 0 | 0 | 0 | 36 |
Conclusion
And that's it. I hope this tutorial helps you, and thanks for getting here.
Top comments (0)