Diego Carrasco Gubernatis

Posted on Jan 17, 2024 • Originally published at diegocarrasco.com on Dec 14, 2023

How to Read Large CSV Files with Pandas Without Knowing its Column Data Types

#csv #data #dataanalysis #dataengineering

TLDR : read a sample of the data, infer the data types, and read the full data with the inferred data types.

Context

Reading large (and sometimes not so large) CSV files in Pandas is a common problem. It is not uncommon to find CSV files with millions of rows and hundreds of columns, and sometimes the data types of the columns are unknown.

Steps

1. Read a sample of the data

Use `nrows` parameter

The first step is to read a sample of the data. This is done with the nrows parameter of the read_csv function.

This helps in understanding the data structure without loading the entire file into memory.

import pandas as pd

df = pd.read\_csv('data.csv', nrows=1000)

The nrows parameter can be set to any number, but it is recommended to use a number that is large enough to get an idea of the data structure, but small enough to not use too much memory.

Use `usecols` parameter

You can also use less memory by reading only a subset of the columns with the usecols parameter, if you already know which columns you need to use.

import pandas as pd

df = pd.read\_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])

Use `nrows` and `usecols` parameter together

You can use both parameters together to read a sample of the data and only a subset of the columns.

import pandas as pd

df = pd.read\_csv('data.csv', nrows=1000, usecols=['col1', 'col2'])

2. Inspect Data Types

Use `dtypes` attribute

The next step is to inspect the data types of the columns. This is done with the dtypes attribute of the dataframe.

print(df.dtypes)

3. Define data types manually

Based on the analysis of the data sample, you can define the data types of the columns manually.

from datetime import datetime, timedelta

dtypes\_dict = {
    'col\_str': str, # String
    'col\_int': int, # Integer
    'col\_float': float, # Float
    'col\_bool': bool, # Boolean
    'col\_datetime': datetime, # Datetime
    'col\_timedelta': timedelta, # Timedelta
    'col\_category': 'category', # Category
    'col\_complex': complex, # Complex number
    'col\_bytes': bytes, # Bytes
    'col\_object': object # Generic Python object
}

Here are the Pandas dtypes for reference.

Some of the not-so-common data types are:

category: A pandas Categorical type. Useful for columns with a limited number of unique values. (e.g. gender, country, city, etc.). Check Categorical data for more information.
complex: A complex number. Check Complex Numbers for more information.
bytes: A sequence of bytes. Check Bytes and Bytearray for more information.
object: A generic Python object. Check Object for more information.
datetime: A datetime object is a single object containing all the information from a date object and a time object. Check datetime for more information.
timedelta: A timedelta object represents a duration, the difference between two dates or times. Check timedelta for more information.

4. Read the full data with the manually-defined data types

The next step is to load the data with the manually-defined data types from the earlier step.

For this, you can use the dtype parameter of the read_csv function.

import pandas as pd

df = pd.read\_csv('data.csv', dtype=dtypes\_dict)

Important

If you get a ValueError, it means that the data types you defined are not compatible with all the data in the file, and you should check the data types again by using more rows or other chunks of the data. For that you may also use skiprows parameter.

import pandas as pd

df = pd.read\_csv('data.csv', skiprows=1000, nrows=1000)

5. Use `chunksize` parameter if the data is too large to fit in memory

If using the options above is not enough to load the data into memory, you can use the chunksize parameter, which returns an iterator that can be used to read the data.

import pandas as pd

df\_iterator = pd.read\_csv('data.csv', chunksize=1000)

for df in df\_iterator:
    # do something with df

DEV Community

How to Read Large CSV Files with Pandas Without Knowing its Column Data Types

Context

Steps

1. Read a sample of the data

Use `nrows` parameter

Use `usecols` parameter

Use `nrows` and `usecols` parameter together

2. Inspect Data Types

Use `dtypes` attribute

3. Define data types manually

4. Read the full data with the manually-defined data types

5. Use `chunksize` parameter if the data is too large to fit in memory

References

Top comments (0)

Read next

"Unlocking Robotics: The Future of Manipulation with IKER and VLMs"

"Aligning AI Values: The Key to Safer Large Language Models"

🚀 Testing Strategies: Essential Tips and Tricks for Developers

"Unlocking Robotic Mastery: The Future of Vision-Language Models in Automation"

Context

Steps

1. Read a sample of the data

Use nrows parameter

Use usecols parameter

Use nrows and usecols parameter together

2. Inspect Data Types

Use dtypes attribute

3. Define data types manually

4. Read the full data with the manually-defined data types

5. Use chunksize parameter if the data is too large to fit in memory

References

Read next

"Unlocking Robotics: The Future of Manipulation with IKER and VLMs"

"Aligning AI Values: The Key to Safer Large Language Models"

🚀 Testing Strategies: Essential Tips and Tricks for Developers

"Unlocking Robotic Mastery: The Future of Vision-Language Models in Automation"

Use `nrows` parameter

Use `usecols` parameter

Use `nrows` and `usecols` parameter together

Use `dtypes` attribute

5. Use `chunksize` parameter if the data is too large to fit in memory