Question: What can I learn about data structures and efficiency by looking at Ruby’s data types—starting with arrays?
This attitude surprised me! Doesn’t ALL computing value efficiency? Isn’t sacrificing efficiency for ease-of-use…well, only a priority for beginners like me?
(I'm very curious about your thoughts on this point, especially if you've been coding a long time—please feel free to comment below!)
So, with my mind on optimization, I wanted to try to use my Ruby knowledge to explore data structures and algorithms. I figured that looking “under the hood” of Ruby would expose me to how it implements data structures, and also expose me to lower-level factors of efficiency. With some guidance from an actual computer scientist (thanks Steve!), I decided to start my investigation by looking at arrays, both the data structure and the Ruby object.
Reading up on arrays, it initially seems like the Ruby array is the exact same thing as the data structure: an ordered collection, indexed with an integer. But a few details started creeping in that helped me see the difference. Specifically, I was floored to find out that the data structure array is a “static” array, meaning its capacity (the maximum number of elements it can hold) is fixed—the overall size of an array cannot change! This is unthinkable in Ruby, because we’re constantly modifying arrays (and often their capacities) without thinking about it!
The reason arrays are "static" this way is because of memory allocation. An array of a certain capacity will need to occupy a single chunk of memory that fits that capacity. (This is why empty arrays still take up memory!) We can’t always guarantee there will be more memory nearby to just “add on” to the original chunk. So, the array’s capacity is static to ensure that you won’t get stuck needing more memory and not being able to get it.
Unlike static arrays, Ruby’s data type array is a “dynamic” array—its capacity can change. So how does Ruby’s array accomplish this?
Thanks to this StackOverflow discussion on arrays and data types in Ruby, I learned that the Ruby source code has functionality built into its arrays to automatically reallocate memory space for a differently-sized array. (I believe it happens in the code right around here, but I can’t say for sure since I haven’t learned C.) Further, Ruby is optimized to give you more array capacity than you probably need so that you don’t need to constantly reallocate memory for it.
As a quirky aside, here’s a funny little fact about these capacity changes from Pat Shaughnessy’s excellent article on data structures in Ruby:
As you add more and more elements, Ruby repeatedly increases the size of the array’s capacity in chunks whenever you cross certain array sizes: 3, 20, 37, 56, 85 elements, etc. The exact thresholds are determined by the ary_double_capa C function inside of array.c. More or less the array capacity increases by 50% each time, although the first increase, from 3 to 20, is a special case. You pay an extra performance penalty each time this happens.
I didn’t expect to find that “under the hood” of Ruby’s arrays!
Takeaway #1: Data structures have strict rules, and the flexibility offered by higher-level languages has to be explicitly built in. For arrays, one of these rules is capacity, because that determines how much memory needs to be allocated for the array. Ruby builds in flexibility for its arrays because you’ll probably want to resize them eventually anyway, and it’s easier to proceed with writing Ruby code without worrying about methods to change array capacities. Great! That explains at least one aspect of the efficiency tradeoff for Ruby’s arrays.
One aspect of arrays (both the Ruby and data structure kinds) that I wanted to learn more about was their efficiency at different tasks. From my initial research, I understood that all data structures offer advantages and disadvantages for different actions (lookup, append, insert, delete)—but since I just came from Ruby-land and was working primarily on small-scale code (no arrays with a million elements for me!), the time differences were negligible.
This summary of the array data structure and its tradeoffs helped me understand the basic strengths and weaknesses I expected to see in Ruby’s arrays—fast lookups and appends, slow inserts and deletes. But up to this point, I hadn’t taken the time to actually measure how fast Ruby’s array methods were running. (It’s really hard to conceptualize Big O notation and its usefulness if you don’t have real examples and performance numbers to look at!)
So with a little help from my Flatiron instructors, I got some code that compares the amount of time to add items to the end of a Ruby array (append, using Ruby’s << method) versus adding items to the middle (insert, using Ruby’s .insert method):
require 'benchmark' ITERATIONS = 100 def add_to_end_of_list(array) array << 99 end def add_to_middle_of_list(array) array.insert((array.length / 2).floor, 99) end factor = 1 while factor < 1_000_000 loops = ITERATIONS * factor factor = factor * 10 array1 =  array2 =  # benchmark adding to end of array start = Time.now loops.times do add_to_end_of_list(array1) end delta_append = Time.now - start puts delta_append * 1000.0 # benchmark inserting into middle of array start = Time.now loops.times do add_to_middle_of_list(array2) end delta_insert_in_middle = Time.now - start puts delta_insert_in_middle * 1000.0 puts end
The program simply loops each action a certain number of times (10, 100, 1000, up to 1_000_000) and prints out how many milliseconds it took to complete. Here’s what I got:
Pretty much as expected—appending stayed fast as the number of iterations grew, and inserting slowed down considerably. The time of appending grew at a relatively constant rate (10 times the items meant roughly 10 times the milliseconds needed), while the time of inserting was growing exponentially. (I eventually gave up waiting for the 1_000_000 insertions to run.)
The reason appending is so much more efficient for arrays is because adding a new element to the end of an array only really changes one thing in the array; none of the elements before the new one have to do anything. However, when inserting an element into the middle, all the following elements have to increase their index number by 1—and as the array grows, the number of elements that need to change their index grows too, hence the exponential jumps in milliseconds.
Here's a handy illustration of the steps needed for inserting into an array, courtesy of InterviewCake:
Takeaway #2: Efficiency for data structures can be measured by scaling up the number of times an action is performed (like appending or inserting), and measuring how the time needed grows. The shape of this growth is what Big O notation refers to! Seeing these numbers and knowing the actions that led to them definitely makes Big O notation and efficiency measurements more tangible and understandable. Plus, it helps to get some practice writing these benchmark tests and understanding their results on a more intuitive level—the next I come across a gigantic array and am asked to to insert something in the middle, I’m sure I’ll remember the boredom and frustration of waiting for the 1_000_000 insertions to run!
Learning about data structures outside of formal computer science education is really daunting, but I’m glad to see that working with Ruby has given me a window into some of its foundational pieces! Obviously, I still have much to learn and clarify—here are a few of the areas I expect to continue exploring going forward:
Linked Lists - I know that Ruby doesn’t explicitly have them, but linked lists are often described in contrast to arrays, so I think I have enough context to start understanding the benefits and costs of using linked lists in other languages.
Memory - I know I’m totally misusing terminology for memory in this article, specifically "chunk"! Chunks and heaps and stacks all have very specific meanings, and I want to understand more about them and how they relate to specific data structures.
Other Optimizations in Ruby - Pat Shaughnessy’s article also touched on “copy on write” optimization for Ruby’s arrays, wherein the same array in memory is shared for different mutations of that array. I vaguely understand how that both uses less memory and makes lookup in memory easier, but as I said above: I know I have lots more to learn about how memory works, and optimization illustrates many of its core principles!
Measuring Benchmarks - I want to deliberately build awareness of efficiency in my code, and getting comfortable writing code to measure specific benchmarks is a big part of that. I’m looking forward to developing that spidey-sense of knowing what to measure, and how, and why, and ultimately what to do/refactor in response.
Big O Notation - now that I have some code illustrating what O(1) and O(n) look like, I can start to see the bigger picture of what Big O notation communicates about efficiency!
Special thanks to Steve Geluso at Flatiron Seattle (the actual computer scientist I mentioned) for providing guidance on learning data structures, as well as the Ruby benchmark code.