This post is for developers, who just got a huge spreadsheet of data and have to import them to the application. Recently I was in that position and I have a couple of takeaways how to make the process easier, faster and less painful.
1. Make a copy of a spreadsheet and fix it
Non-technical people in general see data much different from a developer. You will have to remove duplicates, standardize fields, maybe add additional fields and probably remove a lot of useless data. This step is to make the parsing easier and remember, that most of the time it's easier to change data in the document, than making sophisticated mechanisms in the code.
- For duplicated rows, try adding just another column as a flag field (e.g. products duplicated in different packets - add fields/columns
IsPacketA
,IsPacketB
, etc). - Avoid lots of data in one column - split everything that you think should be different property, but don't overdo -you are not creating a relational database here.
- If some data duplicates are necessary (e.g. product variants - color, size, etc.), make sure that data is consistent. Look for typos and similarities (e.g. XL and big - use one standard for size) and always split such data in the same way (comma
,
worked best for me). - Check floating point character - using comma may cause incorrect parsing, so the best idea would be to use en-Us regionals in the document.
2. Export data to CSV
Working with plain text is much easier, than parsing rows and cells. In C# there is a super simple NuGet called CsvHelper that will load all data in a few lines of code. Make sure that delimiter may conflict with your data - for me, semicolon worked perfectly. I tried working with TSV but in my case it was missing new lines and some empty columns.
3. Start coding with difficult data
Now the data is ready to be loaded in the code. Before developing anything, watch for new lines and special characters - they might be parsed incorrectly and/or cause the application to crash. Maybe you will have to go back to fix the data in the spreadsheet but it much more time-consuming now, than at the later stages.
4. ID's, keys
Store id's/keys in the spreadsheet from step 1. This will help work with the changes at later stages - it will be easier to locate row by key in the database to update it. Guid generator will help you with generating unique values.
5. Explain the "fixed spreadsheet" to the author
Having two sources of data is a nightmare. You will encounter situations, where you will have to investigate, which document is newer, merge changes, resolve conflicts... To avoid this, explain your "fixed document" (step 1) to people who created the source document.
6. Finally, track changes in the document
Google Sheets makes it easy with version history. Maybe think about automating the process.
Top comments (0)