We were debugging an issue with a client’s CSV data dump, when we encountered a problem. This data dump contains bibliographic metadata, and those of you who work with bibliographic metadata will know, you can find pretty much anything in there: special characters, strange markup, html, you name it. This post tells you about an issue we found with the number of ways you can say "newline", and a quick pragmatic solution to mitigating that in your CSVs.
We had written our data out to CSV, and were now reading it back in again. Strangely, some of the lines were malformed - they had been truncated part way through without any obvious reason.
Looking at the file in LibreOffice didn’t help - it looked just fine. So we whittled the data down to one of the problem rows, and looking at it on our python command line could see a possible culprit:
u'Publications: \u2029A Case Report'
Nestled in the middle there is
\u2029 a unicode character that you’re unlikely to be familiar with. It’s the character for a Paragraph Separator, and it’s not entirely insane to find that in an article title (we found about 10 per 100,000 titles during this process). You can even see what it means; it’s intended to provide layout for that title, so it looks like this:
Publications: A Case Study
We had written the data using the python csv writer like this:
f = open(file, "wb", "utf-8") writer = UnicodeWriter(f) for article in articles: writer.writerow([article.id, article.title])
Here UnicodeWriter is a basic wrapper for the csv writer which ensures you are writing everything out encoded correctly.
Next, we’re reading the data like this:
f = open(file, "rb", "utf-8") reader = UnicodeReader(f) for row in reader: # do stuff
(UnicodeReader is just the inverse of the UnicodeWriter above, see the same link for details)
When we look at the rows inside the loop, we see some of them contain truncated content, and some contain the truncated part of the row above. It’s obviously truncating at our Paragraph Separator, but why? Newlines are common enough in CSV cells, and they are handled seamlessly, why is this one special?
Well, there are a ton of ways to say “newline”. If you thought it was just
\r\n you’d be wrong.
\u2029 is a way of saying newline. So is
\u2028 . So is
\u0085. In fact, there are enough ways that we guessed the csv library bundled with python doesn’t know them all, and indeed it doesn’t.
The csv reader cares about newlines, but it relies entirely on your operating system to handle them. That is, it just uses
readline to get the next line in your file when reading it in. And your operating system really does understand all the newline options, and it will terminate the line and hand it back as soon as it hits one.
The csv reader is clever enough to understand that data in the cells might legitimately contain newlines, it doesn’t just assume we’re finished each time readline gives it a line. Because the reader is a state machine, it knows whether we’re inside a cell and if the cell contains a newline it just adds that to the cell’s value. It does that by looking for the delimiter at the start of the cell. If one exists, it won’t give up on reading new lines until it reaches the closing delimiter. If no delimiter exists, though, newline is what it looks for to terminate the cell and the row.
So if your CSV looks like this:
cell 1,cell 2,"a cell\n with newlines"
Your CSV reader will understand that this is one row. But if it looks like this:
cell 1,cell2,a cell\n with newlines
It will interpret it as two.
Our real problem is that the csv writer doesn’t detect these newlines (it’ll get
\n because that’s obvious, but not the others), and by default it doesn’t quote the cell contents unless it spots a special character. It’ll happily write your Paragraph Separator newline character into the file in an unquoted string, and your equivalent reader will not be able to read the row back. That is, by default the Python CSV reader is not symmetrical; it cannot necessarily read data that it wrote!
So, this CSV would work:
article_id,article_title 123456790,"This title has a\u2029 Paragraph separator"
While this would not:
article_id,article_title 123456790,This title has a\u2029 Paragraph separator
As the csv writer does not recognise
\u2029 as a newline character, it does not wrap the quotes around the string like it would if it contained
The fix, once you know this, is easy; when constructing your writer in python you should do it like this:
writer = UnicodeWriter(f, quoting=csv.QUOTE_ALL)
This ensures that your csv cells are ALL quoted, so whatever weird characters wind up in there, you won’t break the file structure. This is generally a good idea even if you are not using Python.
There are just a couple of down-sides, which might matter depending on your context:
This is not the default behaviour for the csv writer, so you have to remember to do it every time you think you’re going to deal with this kind of messy data.
It means everything is quoted, which adds unnecessary size to your file.
When we encountered this problem, there was really no discussion of it on any forums that we could find, so hopefully this will help you if you encounter the same issue.
This post was originally published on Medium in June 2018