We were debugging an issue with a clientβs CSV data dump, when we encountered a problem. This data dump contains bibliographic metadata, and those of you who work with bibliographic metadata will know, you can find pretty much anything in there: special characters, strange markup, html, you name it. This post tells you about an issue we found with the number of ways you can say "newline", and a quick pragmatic solution to mitigating that in your CSVs.
We had written our data out to CSV, and were now reading it back in again. Strangely, some of the lines were malformed - they had been truncated part way through without any obvious reason.
Looking at the file in LibreOffice didnβt help - it looked just fine. So we whittled the data down to one of the problem rows, and looking at it on our python command line could see a possible culprit:
u'Publications: \u2029A Case Report'
Nestled in the middle there is \u2029
a unicode character that youβre unlikely to be familiar with. Itβs the character for a Paragraph Separator, and itβs not entirely insane to find that in an article title (we found about 10 per 100,000 titles during this process). You can even see what it means; itβs intended to provide layout for that title, so it looks like this:
Publications:
A Case Study
We had written the data using the python csv writer like this:
f = open(file, "wb", "utf-8")
writer = UnicodeWriter(f)
for article in articles:
writer.writerow([article.id, article.title])
Here UnicodeWriter is a basic wrapper for the csv writer which ensures you are writing everything out encoded correctly.
Next, weβre reading the data like this:
f = open(file, "rb", "utf-8")
reader = UnicodeReader(f)
for row in reader:
# do stuff
(UnicodeReader is just the inverse of the UnicodeWriter above, see the same link for details)
When we look at the rows inside the loop, we see some of them contain truncated content, and some contain the truncated part of the row above. Itβs obviously truncating at our Paragraph Separator, but why? Newlines are common enough in CSV cells, and they are handled seamlessly, why is this one special?
Well, there are a ton of ways to say βnewlineβ. If you thought it was just \n
or \r\n
youβd be wrong. \u2029
is a way of saying newline. So is \u2028
. So is \u0085
. In fact, there are enough ways that we guessed the csv library bundled with python doesnβt know them all, and indeed it doesnβt.
The csv reader cares about newlines, but it relies entirely on your operating system to handle them. That is, it just uses readline
to get the next line in your file when reading it in. And your operating system really does understand all the newline options, and it will terminate the line and hand it back as soon as it hits one.
The csv reader is clever enough to understand that data in the cells might legitimately contain newlines, it doesnβt just assume weβre finished each time readline gives it a line. Because the reader is a state machine, it knows whether weβre inside a cell and if the cell contains a newline it just adds that to the cellβs value. It does that by looking for the delimiter at the start of the cell. If one exists, it wonβt give up on reading new lines until it reaches the closing delimiter. If no delimiter exists, though, newline is what it looks for to terminate the cell and the row.
So if your CSV looks like this:
cell 1,cell 2,"a cell\n
with newlines"
Your CSV reader will understand that this is one row. But if it looks like this:
cell 1,cell2,a cell\n
with newlines
It will interpret it as two.
Our real problem is that the csv writer doesnβt detect these newlines (itβll get \n
because thatβs obvious, but not the others), and by default it doesnβt quote the cell contents unless it spots a special character. Itβll happily write your Paragraph Separator newline character into the file in an unquoted string, and your equivalent reader will not be able to read the row back. That is, by default the Python CSV reader is not symmetrical; it cannot necessarily read data that it wrote!
So, this CSV would work:
article_id,article_title
123456790,"This title has a\u2029 Paragraph separator"
While this would not:
article_id,article_title
123456790,This title has a\u2029 Paragraph separator
As the csv writer does not recognise \u2029
as a newline character, it does not wrap the quotes around the string like it would if it contained \n
.
The fix, once you know this, is easy; when constructing your writer in python you should do it like this:
writer = UnicodeWriter(f, quoting=csv.QUOTE_ALL)
This ensures that your csv cells are ALL quoted, so whatever weird characters wind up in there, you wonβt break the file structure. This is generally a good idea even if you are not using Python.
There are just a couple of down-sides, which might matter depending on your context:
This is not the default behaviour for the csv writer, so you have to remember to do it every time you think youβre going to deal with this kind of messy data.
It means everything is quoted, which adds unnecessary size to your file.
When we encountered this problem, there was really no discussion of it on any forums that we could find, so hopefully this will help you if you encounter the same issue.
Richard is a Founder at Cottage Labs, a software development consultancy specialising in all aspects of the data lifecycle. Heβs occasionally on the Twitters at @richard_d_jones
This post was originally published on Medium in June 2018
Top comments (0)