3 Ways to Handle non UTF-8 Characters in Pandas

#pandas #datascience #python #linux

So we've all gotten that error, you download a CSV from the web or get emailed it from your manager, who wants analysis done ASAP, and you find a card in your Kanban labelled URGENT AFF,so you open up VSCode, import Pandas and then type the following: pd.read_csv('some_important_file.csv').

Now, instead of the actual import happening, you get the following, near un-interpretable stacktrace:

What does that even mean?! And what the heck is utf-8. As a brief primer/crash course, your computer (like all computers), stores everything as bits (or series of ones and zeros). Now, in order to represent human-readable things (think letters) from ones and zeros, the Internet Assigned Numbers Authority came together and came up with the ASCII mappings. These basically map bytes (binary bits) to codes (in base-10, so numbers) which represent various characters. For example, 00111111 is the binary for 063 which is the code for ?.

These letters then come together to form words which form sentences. The number of unique characters that ASCII can handle is limited by the number of unique bytes (combinations of 1 and 0) available. However, to summarize: using 8 bits allows for 256 unique characters which is NO where close in handling every single character from every single language. This is where Unicode comes in; unicode assigns a "code points" in hexadecimal to each character. For example U+1F602 maps to 😂. This way, there are potentially millions of combinations, and is far broader than the original ASCII.

UTF-8

UTF-8 translates Unicode characters to a unique binary string, and vice versa. However, UTF-8, as its name suggests, uses an 8-bit word (similar to ASCII), to save memory. This is similar to a technique known as Huffman Coding which represents the most-used characters or tokens as the shortest words. This is intuitive in the sense that, we can afford to assign tokens used the least to larger bytes, as they are less likely to be sent together. If every character would be sent in 4 bytes instead, every text file you have would take up four times the space.

Caveat

However, this also means that the number of characters encoded by specifically UTF-8, is limited (just like ASCII). There are other UTFs (such as 16), however, this raises a key limitation, especially in the field of data science: sometimes we either don't need the non-UTF characters or can't process them, or we need to save on space. Therefore, here are three ways I handle non-UTF-8 characters for reading into a Pandas dataframe:

Find the correct Encoding Using Python

Pandas, by default, assumes utf-8 encoding every time you do pandas.read_csv, and it can feel like staring into a crystal ball trying to figure out the correct encoding. Your first bet is to use vanilla Python:



with open('file_name.csv') as f:
    print(f)

Most of the time, the output resembles the following:



<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf16'>


```.
If that fails, we can move onto the second option

### Find Using Python Chardet
[chardet](https://github.com/chardet/chardet) is a library for decoding characters, once installed you can use the following to determine encoding:
```python


import chardet
with open('file_name.csv') as f:
    chardet.detect(f)

The output should resemble the following:



{'encoding': 'EUC-JP', 'confidence': 0.99}

Finally

The last option is using the Linux CLI (fine, I lied when I said three methods using Pandas)



iconv -f utf-8 -t utf-8 -c filepath -o CLEAN_FILE

The first utf-8 after f defined what we think the original file format is
t is the target file format we wish to convert to (in this case utf-8)
c skips ivalid sequences
o outputs the fixed file to an actual filepath (instead of the terminal)

Now that you have your encoding, you can go on to read your CSV file successfully by specifying it in your read_csv command such as here:



pd.read_csv("some_csv.txt", encoding="not utf-8")

DEV Community

3 Ways to Handle non UTF-8 Characters in Pandas

UTF-8

Caveat

Find the correct Encoding Using Python

Finally

Top comments (0)