Test Parquet float16 Support in Pandas

#python #pandas #datascience

Background

As I am on leave I was trying to see if I can contribute to some open source projects so while checking out the issue tracker of Pandas, I found this open issue. This issue got me chasing into exploring what may be the root cause of this issue and can this be fixed at some level in Pandas itself. While the fix part is still pending I found the possible cause and a possible work around to the issue.

Overview

In this example I am trying to test the Pandas support for Parquet also test the bug reported in https://github.com/pandas-dev/pandas/issues/44846

Issue

BUG: Parquet format does not support saving float16 columns

Reproducible Example

import pandas as pd
import numpy as np

data = np.arange(2, 10, dtype=np.float16)
df = pd.DataFrame(data=data, columns=['fp16'])
df.to_parquet('./fp16.parquet')

Issue Description

Pandas does not validate presence of float16 columns in DataFrame as parquet format does not support saving float16 values.

Sample exception

Traceback (most recent call last):
  File "test_parquet_float16.py", line 6, in <module>
    df.to_parquet('./fp16.parquet')
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pandas/core/frame.py", line 2677, in to_parquet
    return to_parquet(
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pandas/io/parquet.py", line 416, in to_parquet
    impl.write(
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pandas/io/parquet.py", line 194, in write
    self.api.parquet.write_table(
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pyarrow/parquet.py", line 1782, in write_table
    with ParquetWriter(
  File "/home/priyab/.conda/envs/airflow/lib/python3.8/site-packages/pyarrow/parquet.py", line 614, in __init__
    self.writer = _parquet.ParquetWriter(
  File "pyarrow/_parquet.pyx", line 1385, in pyarrow._parquet.ParquetWriter.__cinit__
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: halffloat

Reason

It seems there is already a ticket open in PyArrow & Parquet project to support halffloat or float16, as shown by the below issues

https://github.com/apache/arrow/issues/2691
https://issues.apache.org/jira/browse/ARROW-7242
https://issues.apache.org/jira/browse/PARQUET-1647

But from what I can see is that in the Parquet issue, no response has been taken to address this issue. And in return bubble up to PyArrow & Pandas

Workaround

For now the workaround seems to change the dtype of any float16 data type to float32.

Code Example

float16_cols = list(df.select_dtypes(include=['float16']).columns)
new_type = dict((col,'float') for col in float16_cols)
df = df.astype(new_types)

Motivation

Its more often than not I use pandas to solve a quick Data analysis problem and I know its not perfect but in many cases it
gets the thing done. And this particular issue will also be faced in cases where someone infers the data while reading it and its inferred as float16. The workaround is suitable in only those conditions where you don't need to write float16 to a parquet file.

Conclusion

The only reason I have did not look at fixing the issue in parquet or in pyarrow implementation is the complexity around touching the base implementation that will take to adding support for a new data type. Also as a work around exists I would rather live with it at a api consumption layer for now.