DEV Community

Cover image for Dealing with Huge Data
Alisha Rana
Alisha Rana

Posted on

Dealing with Huge Data

It's quite common, especially in large companies, to have datasets that no longer fit in your computer's memory, or if you are performing any kind of calculation, the calculation takes so long that it makes you bored. This means that we must find ways to work with data to make it either small in memory or sample the data, so you have a subset, frequently times it is valid to just take the sample and that sample is representative of all the big data and then make calculations, do data science on it.
We'll import Pandas and add our data to the dataframe.

import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head(5)
Enter fullscreen mode Exit fullscreen mode

To examine the memory footprinting of our loaded data:

df.memory_usage(deep=True)

Enter fullscreen mode Exit fullscreen mode

Image description
Declaring the deep=True because:
The memory footprint of object dtype columns is ignored by default, We don't want items to be neglected in our situation.

Checking the dtype of columns:

df.dtypes
Enter fullscreen mode Exit fullscreen mode

Image description
Notice the dtype of ocean_proximity,
Always keep in mind that strings can take up a lot of memory space compared to numbers, which are particularly effective at doing so.
We will override our ocean_proximity datatype with the pandas-specific categorical datatype.

df["ocean_proximity"] = df["ocean_proximity"].astype("category")
Enter fullscreen mode Exit fullscreen mode

This improves our memory usage,
Lets check memory

df.memory_usage(deep=True)
Enter fullscreen mode Exit fullscreen mode

Image description
waoooooh!!! You can see it reduces more than a half way
In this way you can make your DataFrame more optimal in simple way.
However, the issue with this technique is that even after changing the memory, the memory footprint is still substantial because the data is loaded into our memory.
During the loading process, we can also modify the datatype

df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])
Enter fullscreen mode Exit fullscreen mode

Here, we're going to utilise a dictionary with the key being the column name and the value being the datatype, which means you may use as many as you like.
It will adjust our memory footprint to the data frame automatically during loading.

Instead of importing all datasets because we might not require them all, we will construct a new dataframe and load the data as usual. However, in this time, we will define the columns

df_columns = pd.read_csv("data/housing.csv", usecols=["longitude", "latitude", "ocean_proximity"])
Enter fullscreen mode Exit fullscreen mode

This is another excellent method for saving some space when loading the material.

Sometimes the issue isn't just with the data loading; sometimes it's with the computation itself because we have a costly function. In these cases, we need to sample our data, which pandas makes easy for us because each dataframe has the method sample available.

We have a random state that is really crucial if you want to replicate your analysis and give it to a different coworker or data scientist. This is a really nice thing to become used to

df_columns.sample(100, random_state=42)
Enter fullscreen mode Exit fullscreen mode

but if you want to repeat something, you must ensure that your random process is reusable.

random_state = 42
df_columns.sample(100, random_state=random_state)
Enter fullscreen mode Exit fullscreen mode

I hope you understand how to load data more efficiently and with fewer items.
See you next time!!!

Top comments (0)