If you have used include? on an array while evaluating comparisons then you may want to look at sets instead!
Before we set off on our quest for knowledge we have to know why this quest is so important. We are not going to be saving a princess from a burning building, but instead 17 million princesses from a select 500 thousand burning buildings. (Truth be told there is no fire in this, but I wanted to add a bit of excitement to the story)
I was recently tasked with parsing a CSV file that had over 17 million records in the file and was roughly 8GB. When given the params I initially said "no problem, that would be a welcomed 2-hour break from my other project". I would soon regret those words when I downloaded the files to see that I needed to combine fields from file A which contained roughly 17.9 million records and file B which contained the IDs for 500 thousand records along with some other data that had already been manually by someone.
My first attempt, was not pretty because of my prior experience with modifying CSVs. I had not had to modify anything larger thank 50k items at most, which is 1/10 of the smallest dataset in this task.
My second attempt was feeling around how to parse the data properly and then trying to parse the biggest files. Again it proved to be a failure, but lead me to the correct way of parsing files, just not the correct way of dealing with that much data.
My third attempt was a step in a positive direction, giving me the appearance that I was doing things in a better way, however, looking back at it I added un-needed complexities that hurt the performance rather than boosted it. The concept was simple, create a list of item ID's then be sure there are no duplicates, and finally iterate through the largest file only performing modifications on data that was needed in the final iteration.
My fourth attempt was another positive step because it de-dupped the number of records in the largest file, only processing the ones that need to be processed, but does that inherently add performance.
Looking at both the third and fourth attempts I did fall into a Ruby trap here, instead of thinking in terms of data I was thinking in terms of ruby weaponry. Knowing what I could accomplish easily with Arrays and not looking at that as a potential downfall.
My fifth attempt, well there were many more than 5 attempts, but let us assume for this article that I was smart enough to only do it in 5 attempts with the help of StackOverflow. Basically I got smarter in handling my hashes, writing CSV, and switching to sets for comparisons as recommended from StackOverflow.
This attempt is the attempt that saved all the princesses and put out the fires without even blinking. Before attempt 5 the maximum amount of time I let the script run was about 5 days because of a long weekend. That is an enormous amount of time, but after attempt 5 reprocessing took 7-10 minutes. It made me feel ill, but also very happy.
My first mistake was unfamiliarity with large datasets leading me down the road to being decapitated by the big O notation dragon who guarded the princesses. Moreover, it reveals that when programming I need to be more cognizant of more performant processes and not just default to things I've done in the past.
Ruby is like having a locker of ready to go weapons that can accomplish tasks quickly, but instead of saving the princesses with my awesome sword of justice, I cut my arm off. It wasn't me who realized this though, it was a user(Casper) on StackOverflow who pointed out my mistake.
My approach was to use an array of items to check for inclusion which in my given subset of objects was going to be a minimum of 500,000 each time I went over a single record in the 17 million records, but in fact I was doing this twice with processed items, so that was growing the number of checks which at its smallest amount was 500,000 and largest was 1 million. This means that 17 million records made a million checks on objects before it ever did any actual processing. Benchmark testing on these items showed me the error of my ways and showed me that the script as I had it in iteration 4 would have finished in 156 days.
I can't blame Ruby, it was doing exactly what I told it to do. Instead I need to blame myself for not thinking of performance in such a simple little script. I can blame in-experience, but I think the biggest thing to look at is the ruts of everyday programming. As a junior developer, you get things working and just try to get by figuring things out, as a senior level developer you need to look more at performance, maintainability, and reliability. In that same retrospect you need to get out of the ruts you may have developed through your working life.
I wanted to see what this comparison of sets vs arrays in comparing simple numbers such as an ID. I wrote a simple benchmark test and ran it similarly to running the other script just to see the output or performance of set vs array. In comparisons sets are 14 times faster than arrays and I find that quite amazing!