wrighter

Posted on Nov 22, 2020 • Originally published at wrighters.io on Nov 22, 2020

Converting types in Pandas

#python #pandas

Pandas is great for dealing with both numerical and text data. In most projects you’ll need to clean up and verify your data before analysing or using it for anything useful. Data might be delivered in databases, csv or other formats of data file, web scraping results, or even manually entered. Once you have loaded data into pandas, you’ll likely need to convert it to a type that makes the most sense for what you are trying to accomplish. In this post, I’m going to review the basic datatypes in pandas and how to safely and accurately convert data.

DataFrame and Series

First, let’s review the basic container types in pandas, Series and DataFrame. A Series is a one dimensional labeled array of data, backed by a NumPy array. A DataFrame is a two-dimensional structure that consists of multiple Series columns that share an index. A Series has a data type, referenced as dtype, and all elements in that Series will share the same type.

But what types?

The data type can be a core NumPy datatype, which means it could be a numerical type, or Python object. But the type can also be a pandas extension type, known as an ExtensionDType. Without getting into too much detail, just know two very common examples are the CategoricalDType, and in pandas 1.0+, the StringDType. For now, what’s important to remember is that all elements in a Series share the same type.

What’s important to realize is that when constructiong a Series or a DataFrame, pandas will pick the datatype that can represent all values in the Series (or DataFrame). Let’s look at an example to make this more clear. Note, this example was run using pandas version 1.1.4.

>>> import pandas as pd
>>> s = pd.Series([1.0, 'N/A', 2])
>>> s
0 1
1 N/A
2 2
dtype: object

As you can see, pandas has chosen the object type for my Series since it can represent values that are floating point numbers, strings, and integers. The individual items in this Series are all of a different type in this case, but can be represented as objects.

>>> print(type(s[0]))
<class 'float'>
>>> print(type(s[1]))
<class 'str'>
>>> print(type(s[2]))
<class 'int'>

So, what’s the problem?

The problem with using object for everything is that you rarely want to work with your data this way. Looking at this first example, if you had imported this data from a text file you’d most likely want it to be treated as numerical, and perhaps calculate some statistical values from it.

>>> try:
... s.mean()
... except Exception as ex:
... print(ex)
...
unsupported operand type(s) for +: 'float' and 'str'

It’s clear here that the mean function fails because it’s trying to add up the values in the Series and cannot add the ‘N/A’ to the running sum of values.

So how do we fix this?

Well, we could inspect the values and convert them by hand or using some other logic, but luckily pandas gives us a few options to do this in a sensible way. Let’s go through them all.

astype

First, you can try to use astype to convert values. astype is limited, however, because if it cannot convert a value it will either raise an error or return the original value. Because of this, it cannot completely help us in this situation.

>>> try:
... s.astype('float')
... except Exception as ex:
... print(ex)
...
could not convert string to float: 'N/A'

But astype is very useful, so before moving on, let’s look at a few examples where you would use it. First, if your data was all convertible between types, it would do just what you want.

>>> s2 = pd.Series([1, "2", "3.4", 5.5])
>>> print(s2)
0 1
1 2
2 3.4
3 5.5
dtype: object
>>> print(s2.astype('float'))
0 1.0
1 2.0
2 3.4
3 5.5
dtype: float64

Second, astype is useful for saving space in Series and DataFrames, especially when you have repeated types that can be expressed as categoricals. Categoricals can save memory and also make data a little more readable during analysis since it will tell you all the possible values. For example:

>>> s3 = pd.Series(["Black", "Red"] * 1000)
>>>
>>> s3.astype('category')
0 Black
1 Red
2 Black
3 Red
4 Black
        ...
1995 Red
1996 Black
1997 Red
1998 Black
1999 Red
Length: 2000, dtype: category
Categories (2, object): ['Black', 'Red']
>>>
>>> print("String:", s3.memory_usage())
String: 16128
>>> print("Category:", s3.astype('category').memory_usage())
Category: 2224
>>>

You can also save space by using smaller NumPy types.

>>> s4 = pd.Series([22000, 3, 1, 9])
>>>s4.memory_usage()
160
>>> s4.astype('int8').memory_usage()
132

But note there is an error above! astype will happily convert numbers that don’t fit in the new type without reporting the error to you.

>>> s4.astype('int8')
0 -16
1 3
2 1
3 9
dtype: int8

Note that you can also use astype on DataFrames, even specifying different values for each column

>>> df = pd.DataFrame({'a': [1,2,3.3, 4], 'b': [4, 5, 2, 3], 'c': ["4", 5.5, "7.09", 1]})
>>> df.astype('float')
     a b c
0 1.0 4.0 4.00
1 2.0 5.0 5.50
2 3.3 2.0 7.09
3 4.0 3.0 1.00
>>> df.astype({'a': 'uint', 'b': 'float16'})
   a b c
0 1 4.0 4
1 2 5.0 5.5
2 3 2.0 7.09
3 4 3.0 1

to_numeric (or to_datetime or to_timedelta)

There are a few better options available in pandas for converting one-dimensional data (i.e. one Series at a time). These methods provide better error correction than astype through the optional errors and downcast parameters. Take a look at how it can deal with the first Series created in this post. Using coerce for errors will turn any conversion errors into NaN. Passing in ignore will get the same behavior we had available in astype, returning our original input. Likewise, passing in raise will raise an exception.

>>> pd.to_numeric(s, errors='coerce')
0 1.0
1 NaN
2 2.0
dtype: float64

And if we want to save some space, we can safely downcast to the minimim size that will hold our data without errors (getting int16 instead of int64 if we didn’t downcast)

>>> pd.to_numeric(s4, downcast='integer')
0 22000
1 3
2 1
3 9
dtype: int16
>>> pd.to_numeric(s4).dtype
dtype('int64')

The to_datetime and to_timedelta methods will behave similarly, but for dates and timedeltas.

>>> pd.to_numeric(s4).dtype
dtype('int64')
>>> pd.to_timedelta(['2 days', '5 min', '-3s', '4M', '1 parsec'], errors='coerce')
TimedeltaIndex([ '2 days 00:00:00', '0 days 00:05:00', '-1 days +23:59:57',
                  '0 days 00:04:00', NaT],
               dtype='timedelta64[ns]', freq=None)
>>> pd.to_datetime(['11/1/2020', 'Jan 4th 1919', '20200930 08:00:31'])
DatetimeIndex(['2020-11-01 00:00:00', '1919-01-04 00:00:00',
               '2020-09-30 08:00:31'],
              dtype='datetime64[ns]', freq=None)

Since these functions are all for 1-dimensional data, you will need to use apply on a DataFrame. For instance, to downcast all the values to the smallest possible floating point size, use the downcast parameter.

>>> from functools import partial
>>> df.apply(partial(pd.to_numeric, downcast='float')).dtypes
a float32
b float32
c float32
dtype: object

infer_objects

If you happend to have a pandas object that consists of objects that haven’t been converted yet, both Series and DataFrame have a method that will attempt to convert those objects to the most sensible type. To see this, you have to do a sort of contrived example, because pandas will attempt to convert objects when you create them. For example:

>>> pd.Series([1, 2, 3, 4], dtype='object').infer_objects().dtype
int64
>>> pd.Series([1, 2, 3, '4'], dtype='object').infer_objects().dtype
object
>>>pd.Series([1, 2, 3, 4]).dtype
int64

You can see here that if the Series happens to have all numerical types (in this case integers) but they are stored as objects, it can figure out how to convert these to integers. But it doesn’t know how to convert the ‘4’ to an integer. For that, you need to use one of the techniques from above.

convert_dtypes

This method is new in pandas 1.0, and can convert to the best possible dtype that supports pd.NA. Note that this will be the pandas dtype versus the NumPy dtype (i.e. Int64 instead of int64).

>>> pd.Series([1, 2, 3, 4], dtype='object').convert_dtypes().dtype
Int64
>>> pd.Series([1, 2, 3, '4'], dtype='object').convert_dtypes().dtype
object
>>> pd.Series([1, 2, 3, 4]).convert_dtypes().dtype
Int64

What should you use most often then?

What I recommend doing is looking at your raw data once it is imported. Depending on your data source, it may already be in the dtype that you want. But once you need to convert it, you have all the tools you need to do this correctly. For numeric types, the pd.to_numeric method is best suited for doing this conversion in a safe way, and with wise use of the downcast parameter, you can also save space. Consider using astype("category") when you have repeated data to save some space as well. The convert_dtypes and infer_objects methods are not going to be that helpful in most cases unless you somehow have data stored as objects that is readily convertible to another type. Remember, there’s no magic function in pandas that will ensure you have the best data type for every case, you need to examine and understand your own data to use or analyze it correctly. But knowing the best way to do that conversion is a great start.

DEV Community