Identifying and Removing Hidden Characters

#learning #elixir #thisweekilearned #unicode

I came across an interesting problem this week. There was a hidden character in a CSV preventing me from successfully performing a batch data insert in the Elixir/Phoenix project I've been working on.

The TL;DR: You can use the tool xxd to help identify hidden characters. You can then use the pbcopy and pbpaste shell commands to eliminate that hidden character.

As I said, I was trying to ingest a CSV and batch insert that block of data into a table. You could say CSV imports and exports are the backbone of this project, so we have some pretty established patterns established. The brunt of the import work is handled with the NimbleCSV library for Elixir. When it's configured property, NimbleCSV will parse out the CSV, and you can then pipe that into an Enum.each, and essentially insert each line of the CSV as a full changeset over and over. The whole function looks like this:

When I tried to execute this data insertion I ran into this error:

The error stated that I couldn't insert that record because the vendor_id attribute could not be blank. And vendor_id was blank because when it was being extracted as the key from the CSV it was coming in with extra characters prepended to it: \uFEFFvendor_id. The keys of the param struct were defined by the headers of the CSV, so that meant, at some point those characters had been included to the vendor_id column title. It just so happened that vendor_id was the first column and header of the file, so if something like this was going to happen, I'm not surprised it was happening here.

First thing I did was look at the CSV in VSCode, and it looked fine there. Then I opened it in Excel, but again, it looked perfectly fine. I figured I should probably see exactly what these extra characters were, since it was obviously not going to be a simple fix. A detailed answer can be found here, but it's basically a Unicode Byte Order Mark, which is included at the beginning of a file and tells GUIs how to interpret the coding of the rest of the data. This makes sense as to why I was not seeing it as an extra character in any of the software I was using, but also a bit of a head scratcher. How could I remove a Unicode character if I couldn't see it in any GUIs I was using?

Luckily I have awesome colleagues, and one suggested some tools via the terminal. We started with xxd, which gives a hexdump of a given file. I executed this command:

And got a rather large output, but this was the first line:

side note: You can execute xxd more simply, like this:

But that outputs the hexdump of the entire file. By using head -n 4 and piping the file path to xxd I was telling it to only return the first 4 lines of the file

Finally, I was able to see it! It showed as efbb and the ... in the readout to the right (which wasn't showing in any GUI I had been using). Now, how to remove it? It turned out that OS X shell comes with pbcopy/pbpaste commands built in. In addition to being useful copy and paste tools in the command line, they also strip out this type of stuff. Unfortunately, I can't find the search result where we realized this, but I know it worked in my case. These are the two commands I executed:

Then I ran the file through xxd again to confirm, and this was the first line of the output:

The unicode value was indeed removed. I took that updated CSV, ran it through the batch insertion function again, and got no errors.

This post is part of an ongoing This Week I Learned series. I welcome any critique, feedback, or suggestions in the comments.

DEV Community

Identifying and Removing Hidden Characters

Top comments (0)

Read next

What is Docker? Advantages, Disadvantages, and its Role in AI Applications

Why You Shouldn’t Use Libraries Blindly: A Lesson in Bundle Size

Why your brain isn't broken (but documentation might be) - a tab hoarder's musings on learning

LINUX GUIDE USING UBUNTU