Henning

Posted on Jan 7, 2023

The joys of loading CSV files

There is no shortage of CSV files. Reading, parsing and writing these files to databases can probably be a full time job in some cases, and there are a number of both open-source and SaaS solutions to load CSV files to a database. But not all solutions are the same, and the devil is often in the details. Let's take a look at some of the common and less common aspects you need to handle.

File encoding
File- and folder patterns
Separator
Headers
Extra headers
Column type inference
Blank lines
Uniqueness
File name reuse
quoting
extra columns
line breaks

Some of these are obvious to many, others are subtle. Let's go through them one by one.

File encoding

Most of us have argued with encodings, but we are also generally quite lucky nowadays to use UTF-8 most of the time. While some tools are able to guess the encoding of files, the encoding of a file is not authoritatively declared anywhere. Many Extract/Load tools either assumes that the encoding is utf-8, or requires the encoding to be specified when setting up the job.

And of course, there are many interesting encodings. While UTF+8 and miscellaneous Latin1 encodings are common in Europe, and I won't even guess about encodings common in Asia, there are also things like windows-1252, and utf-8-sig, which apparently is utf-8 with a byte-order-mark.

File- and folder patterns

One folder on one SFTP server (or wherever you may read from) might contain a bunch of different files that should be loaded into different tables, typically based on the file prefix. Being able to use regex to select only certain files in the folder might be vital. Especially if not all files in there are CSV. Imagine trying to parse some random PDF that made its way in there.

Separator

Common field separators are comma (,), semicolon (;), tab (\t) and pipe (|). I haven't seen all that many uncommon field separators, but colon (:) is one. Specifying the correct separator probably isn't difficult, and some tools are even able to guess. But sometimes, there is a very limited list to choose from. And you might be out of luck.

Headers

Some files have a header row, others don't. And some files have many, as we'll get in to. Reading CSV files into databases, the header usually becomes the column names. Actually, most tools require a header row and has no ability to just name columns from c1 to c100.

Another issue with headers is special characters. Special characters often aren't valid column names, so they need to be cleaned up. Even worse, if you are truly unlucky, the cleaned-up name crashes with another column name. Say, for instance, there are two columns that differ only in their special characters. Replacing special characters with underscodes (_) which is common, would lead to duplicate column names.

Related to both headers and separators, there might also be a padding at the start and end of each field. This is fortunately rare, but can make column names weird if the padding (usually space) isn't trimmed away. Dealing with extra spaces in the data itself is also annoying, and may cause a lot of fields to be cast as string instead of numbers or other data types.

Extra headers

Some files have extra lines with random data before the actual CSV file begins. This might be metadata in the form of extra column references, or perhaps a manifest specifying where and how the file was produced. In any case, before you can parse the file as a CSV, you have to skip a number of lines. Hopefully, you know how many lines beforehand, and the tool you use lets you specify a number of lines to skip.

Null values

Most sane CSV files simply don't write anything for null values - they are marked only by two separator signs in a row. But sometimes, nulls gets written out as null or NULL or N/A. It might not be the end of the world that these gets parsed as strings, but it creates a lot of cleanup.

Column type inference

Most CSV readers try to infer the data type by sampling. Sometimes though, the sampling doesn't catch the full variation of values, and it ends up declaring numbers where it should have created strings or ints where it should have created numbers. This is practically bound to fail downstream.

Most systems do type inference by default, but some have the option to either turn it off, specify the schema manually, or review and correct the result of the auto-inference.

In extreme cases, automatic column type inference can cause data loss. The string 123456789.0123456789 would be truncated if cast as a floating point number, and likely to turn up as roughly 123456789.0123456 in the target database due to a combination of truncation and floating point rounding. That's three characters out the window, never to be seen or heard from again.

Blank lines

There are two types of blank lines: blank lines that originated in the source system, and maintains the field separator - so the entire line is just a bunch of commas or similar. This might be a problem because the key column is null, which is hardly compatible with uniqueness (if that is configured).

The other type of blank line are entirely blank lines - these might come from processes that appends lines to existing files, with additional blank lines at the end. CSV parsers might have trouble with these files, and so your loading process might fail.

Uniqueness

You may or may not care about uniqueness, but in either case you need to be aware of it. Some systems explicitly require some definition of uniqueness, others use file name + line number as uniqueness, and others again allow you to forego uniqueness altogether.

If your files are written once and never updated thereafter, uniqueness might not matter much. You read any files that has been added, and append it to your target database pretty much no matter how the key is specified (as long as you make sure you don't specify a key that isn't actually unique).

Often, the unique key is only unique to that particular file. In order to get global uniqueness, you need to specify the file name to be part of the key. The file name is often included as a meta-column, so being able to refer to this as part of the unique key is probably important.

Using file name + line number as uniqueness might seem scary, but it is actually a very common approach and there is only one edge-case where it might fail: If the file is updated very random, practically rewritten and rows are removed, you may end up with duplicate values on business-keys.

File name reuse

Sometimes, the file to load doesn't change name, only content. Think of something like a file named data.csv, which gets replaced one a day. In that case, you probably want to append the target table. This works nicely if you are able specify that there is no unique key, or specify explicitly that your table should only be appended to.

quoting

Text fields should be quoted in order to make sure commas, semicolons or other signs are not interpreted as field delimiters. But when fields gets quoted, any quote signs inside the text should be escaped so that they aren't interpreted as the end of the text field. Whatever system produces the CSV file, should be able to quote and escape the file appropriately. If not, it's an error waiting to happen.

extra columns

The number of columns in the header and the number of columns in the rows should match, but there are no guarantees. What should happen if there are extra columns in some rows? Situations like this might happen if a text field isn't quoted appropriately, and a comma in a free-text field gets interpreted as a field separator. Or, in Europe, if commas are used both as field separator and a decimal mark on floating point numbers.

This is a bad situation no matter what, but you should be aware of it and have thoughts about what you would want to happen. Maybe you actually prefer that the pipeline fails in those cases. Or maybe the priority is to capture all the data and try to figure out the mess later.

line breaks

Free-text fields especially might contain line breaks, which is probably OK if the field is quoted - but check to make sure. But if it isn't, these line breaks are interpreted as new rows. This will also mess with the column count. No matter how the loader handles line breaks in unquoted fields, they will be a problem.

As long as the fields are quoted and the loader can handle it, the only thing you need to keep in mind is that this can mess with row counts. The number of rows in the file will be higher than the number of records in the resulting table, and that's how it should be.

Conclusion

Hopefully you won't stumble into all of these issues all at once, but when evaluating a robust extract/load system these are some issues you should be aware of and test for.

And I promise this isn't an exhaustive list.

DEV Community