Did you know that there is a standard for csv files? The standard is RFC 4180. I didn't know until last week. Before that, I simply assumed that comma separated values were literally just that: values separated with commas (or some other separator character). This is a short personal experience about how I messed up and how international standards could've reduced the impact.
I'm working in a subset of systems of a large software infrastructure. One of the systems is a process that runs every night and synchronizes a load of data from and to external systems. One of the things that it does, is reading a few database tables into csv files for some other external system to pick up and process. The client had a relatively simple wish: They wanted a button so we could restart this particular part of the process if anything went wrong.
Since this part of the routine was hardcoded into the process, I couldn't just add a button, send a signal and be done with it; I had to move this part of the process to a different application in the infrastructure. Here's where the trouble started: The original application just used a standard database connection, took a list of table names, queried for the columns of each table, started a reading stream and then just concatenated the string values together into a file. The new application used Entity Framework, so copy-paste was not an option.
It's just a csv file, right? What could go wrong? Pretty much everything, it turns out. To write the tables to files with Entity Framework, I used CsvHelper: A package that provides easy reading and writing of csv files using strongly typed models. We let it run and we checked the output: All files are there, they are csv files, they contain the expected data, they were production ready. The release went well and we didn't hear anything until a few days later: The csv files couldn't be processed, because the format had changed.
CsvHelper implements the RFC 4180 standard; The old implementation (unsurprisingly) did not. On top of that: The issues seemed to indicate that the consumer of the csv files had also DIY'd their csv reader, so this was a recipe for disaster. We identified several issues:
The separators were supposed to be pipes (|). Although slightly unconventional, I can imagine it's not unheard of. Simply changing the separator in the CsvHelper configuration fixed that.
The values contain unexpected double quotes. This was an indication to me that suggested that the consumer had built their own simplified csv reader. Had they used a well-supported library, it would've likely recognized and handled the double quotes correctly. The next issue strengthened my suspicions:
The data rows contain more columns than the header row. If there's anything that CsvHelper doesn't do, it is making blatant mistakes like this. Some of the values contained pipes. I'm very confused: why use pipes as separators when they knew that the data also contained pipes? Regardless, CsvHelper had correctly identified values with pipes and put the values between quotes. The consumer however, assumed every pipe was a separator and ignored the quotes. It turned out that the original code just replaced all the pipes for spaces. To fix this, I had to disable quoting and escaping. I also had to write a custom string converter to replace all the pipes with spaces, just like the old code.
Date values are in the wrong format. I should've expected this, to be fair. CsvHelper by default uses the format bound to a
CultureInfo that you pass to the constructor, the invariant culture in my case. A good standard to use for datetimes would've been ISO 8601. Invariant culture doesn't use that format by default though, nor did the consumer use this format. To fix this, I explicitly set the datetime format in the CsvHelper configuration.
Some columns are missing. This has nothing to do with standards, the columns in the database simply didn't match with the properties of the model we used in Entity Framework.
I messed up pretty badly. Most, if not all, issues could've been prevented if I had simply compared the output of my code with the original. This failure was on me, because I was careless with the output for an existing feature and I simply should've delivered it based on the existing specs.
I also learned that there are standards out there for a lot of things, some things that I haven't even thought about. For these standards, there are also libraries that make it easy for you to follow the standards. Had the original feature used a library and by extent used an international standard, most of these issues could've been prevented.
From now on, if I get to build a brand new feature, I will make sure that I'm aware of all the relevant international standards and I will try to keep to those with libraries that support them. This will help my fellow developers so they won't find themselves in a similar situation as me. If I have to refactor an existing feature, I'll make sure from now on that I compare the new output to the original before delivering.
So how about you? Have you ever made a big mess of your data? How have international standards helped you and do you think everyone should adhere to them? Let me know with a comment!
For now I'll say: thanks for reading and I hope to see you in my next blog! 😊