Edison Yap

Posted on Feb 12, 2019

How arrays work the way arrays work

#ruby #computerscience #c

In computer science, there is the concept of linear data structure, which means that the datas are structured in a linear way in which order matters. There are Array and Linked Lists, but today I'll be talking mostly about arrays only, and a little about linked lists.

Most object-orientated languages come with Arrays, whereas most functional languages comes with Linked Lists (see why in another one of my article, mentioned at the bottom of the article).

There's a good reason for this differentiation which we will dive into later, but for now let's just have a quick revision about what are the differences between the two data structures, and to know this, we'll need to take a trip down memory lane.

Rewind Time

Objects and functions and everything that we know about computers, are fundamentally stored in bits and bytes in computer.

In languages like Java and C, you have to explicitly declare the size of an array beforehand.

Hold up, but Ruby does not do that?

In Ruby, we use Array for our linear data structure, we can add seemingly infinite stuffs into an Ruby array and it would not matter as far as we are concerned.



array = []
while true
  array << 0
  puts "#{array.length}"
end

That's great isn't it?! That means arrays are infinitely large right? And that Ruby is the superior language? Lucky us!

But not so fast. *pops your bubble*

There are no infinite size array, what you see in Ruby is what we call a Dynamic Array, and it comes with a cost.

To understand what dynamic arrays are, let's first take a look at how arrays are represented in memory. Since MRI Ruby (Matz' Ruby Interpreter) compiles down to C code, we'll look at how arrays are represented in C.

C-ing is believing

We'll dive into a little bit of C code to help us C a little better.

In lower level languages like C, you have to deal with pointers and memory allocation yourself, and even if you had not dealt with C before (Disclaimer, neither have I), you may have C-een one of the most (in)famous example below:



/* Example */
ptr = (cast-type*) malloc(byte-size)

/* Usage */
ptr = (int*) malloc(100 * sizeof(int));

taken from https://www.programiz.com/c-programming/c-dynamic-memory-allocation

Let's break down this bits of code:

malloc doesn't have any magic meaning behind it, it literally stands for memory allocation
malloc returns a pointer
malloc takes in an argument, which is the size the memory you want the program to allocate for you.
100 * sizeof(int) tells the program that we want to store 100 integers, so allocate us 100 * the size of what each integer would occupy.
ptr/pointer stores reference to the memory address.

Timmy stores luggage!

If the above example did not really make sense, try this analogy. Think of memory allocation as like, a luggage concierge. It works like this:

Timmy goes to the counter, tells the concierge that he have 2 pieces of luggages, about this big, and that that he'd like to store it in the storeroom.
The concierge takes a look over the store room and go "Yes, we've got some room at the designated B4 area and will allocate that space to store your luggages".
They hand Timmy a pick-up card with the designated area on it, B4 in our case.
Timmy is happy, goes around doing whatever, and when he wants to pick-up his luggages, he goes back to the counter and show them his pick-up card. "Have you C-een my luggages?"

In our example, Timmy's luggage is the data, the pick-up card is the pointer (it states where Timmy's bag is stored). The place that the concierge is storing Timmy's luggage is the memory block, and the counter is the program.

By showing the counter (the program) Timmy's card (pointer/memory address), Timmy can retrieve his luggage (data). Bonus? Because they know exactly where Timmy's bag is stored -- B4, this means that they can retrieve all Timmy's luggages relatively quickly!

Also, ever wondered why you access elements in array with index, like so?



array[0] # => first element
array[5] # => sixth element
array[99] # => hundredth element

This is because array holds the references to the memory block, and the index is telling it the offset.

An analogy of that is if I ask you to look for Timmy in a queue of 20 people, you would logically have to ask each of them if they were Timmy -- But, if I told you Timmy is the 6th (index) from the first guy (your original pointer), you know exactly where to look.

Retrieving elements in arrays are fast exactly because of this - the program doesn't have to look through all 100 elements to find what you're looking for - if you've got the index, it just has to add the offset to the original memory address, and the droid that you were looking for will be right there!

What are dynamic arrays then?

So I've told you a little about how array is represented in memory, but now it's time to talk about some cons.

Remember how you have to explicitly declare the amount of memory you need? This means that the array will find a spot that will fit exactly your size; there's no guarantee that it will be able to fit more than what you have (because the memory block right behind it might be occupied).

Back to our luggage analogy, think of it as if Timmy was to store 2 pieces of luggages, and B4can store exactly 2 luggages, so they allocate that to Timmy. Now for some reason Timmy wants to store another piece of luggage, but B4 can't store 3 luggages, only 2, so what do they do?

They take all of his existing luggages, move it into a new spot that can fit more than 3 luggages, and then store all of them together.

That is an expensive operation but it's exactly how memory works too!

In Ruby, you don't have to declare a specific size before hand, but that's because Ruby handles it for you automagically through dynamic array.

What dynamic array does is that if the array is nearing its full capacity, it'll automatically declare a new, bigger array and move all existing elements into it, and the old array is then garbage collected. How much bigger? The growth factor is usually 2; double the size of the current array.

In fact, don't take my word for it.

Ruby has a ObjectSpace module that allows us to interact with current living object in memory. We can use this module to take a peek at the memory usage of our dynamic array, sounds exactly like what we want!

I have written a small Ruby script that calculates the growth factor of the dynamic array. Feel free to take a look at it here, and if you do, you can see that Ruby arrays have a 1.5x growth factor (that is, they make an array that's 50% bigger on every copy)

I know what arrays are, what are linked lists?

Keep in mind that although arrays and linked lists are both considered as linear data structures, they have one big difference amongst them.

Elements in an array are stored literally right next to each other in memory (so we can have index for fast lookups), but nodes in linked lists have no such restriction (which is why there's no index lookup for linked lists) - each and every item can be stored anywhere on the memory block and it'd be fine.

It's almost as if Timmy's trying to store 5 luggages, and the concierge does not have a space and decides to leave 5 of them all over the place. Sounds unorganized?

Also if they are stored in different places, how do you know which bags are Timmy's? Hint: Just keep track of the next node/bag! In our case, the concirege keeps them separately but with a tag on each of them that points to the next bag.

A node in linked list consists of two parts - the data part and a pointer to the next node. This is how they're able to maintain the linear part of it -- they still have the concept of order, they just don't have to be stored in order literally!

node = [ data | pointer ]

For example, given the following example stored in memory:

[C | D] [A | B] [B | C] [D | nil]

This bits look like it's out of order - but if I had told you that the first element is A, you would be able to tell me the exact order of the list:

list = [A -> B -> C -> D -> nil]

There's a lot of interesting things that you can do with linked lists which I'm not going to dive into (also a lot on Big O that I didn't talk about), there are already a tonne of good articles out there on data structures. If you made it here, I suggest you to read from Ali's blogpost here.

thank u, next: an introduction to linked lists

Ali Spittel ・ Dec 5 '18

#datastructures #beginners #programming #javascript

You can also read more about List/Cons on Wiki here

Closing Note

I initially wrote this article for a slightly different topic - [ Elixir | Why Linked List?], but found it took way too long to explain how arrays work before I was able to explain explore why Elixir uses linked lists, hence I have separated them into two separate articles.

In that article I talk about why functional langauges use linked lists as their linear data structure. Do check it out!

[ Elixir | Why Linked Lists? ]

Edison Yap ・ Feb 12 '19

#elixir #computerscience #functional

Sources

1) https://medium.com/@rebo_dood/ruby-has-a-memory-problem-part-1-7887bbacc579 - This is where I found out about additional ObjectSpace methods by requiring it

Top comments (7)

Cheah Jer Fei • Mar 25 '19

Thanks for this nice article! I finally know how Ruby implements dynamic array 😂

But does object works the same way in Ruby? I know array is also an object but I mean instance of a self defined class,. Cause object is also dynamic in Ruby, how do they manage when it grows and takes up too much of memories?

Edison Yap • Mar 25 '19

Thanks for the nice words!

I don't think you have to treat them any different - array or object or something else, they're still represented as just values on memory. So same thing would happen with objects.

I believe if you happen to obtain too much memory (either due to inefficient GC handling or you're working with large datasets), the process will just crash.

Fun Fact: Everything in Ruby is initialized with memory size of 40.

require 'objspace'

Foo = Class.new

ObjectSpace.memsize_of(Foo.new)
# => 40

ObjectSpace.memsize_of('a')
# => 40

ObjectSpace.memsize_of([])
# => 40

ObjectSpace.memsize_of({})
# => 40

A quick dig around shows that it's because one slot on Ruby heap == 40 bytes, that's where our magic 40 comes from :)

Cheah Jer Fei • Mar 25 '19

Yeah I understand that every object in Ruby takes up a fixed size of 40 bytes, was wondering is it possible for it to grow beyond 40 bytes, like too much fields etc 🤔If so, what's gonna happen...

Btw, I saw your snippet gist.github.com/edisonywh/c61b3ab5... , you divided memory by 8 to find memory_in_bytes, but isn't memsize_of returning memory size in bytes?

Edison Yap • Mar 25 '19 • Edited

I took to your question and tried to look into it a little, and found some interesting stuffs!

require 'objspace'

class Foo
  attr_reader :string
  def initialize
    @string = 'a' * 10_000
  end
end


instance = Foo.new
string = instance.string

puts "Instance size: #{ObjectSpace.memsize_of(instance)}"
#=> 40
puts "String size: #{ObjectSpace.memsize_of(string)}"
#=> 10041

I don't think the value here matters, but what's really interesting to me is that it looks like the memory allocation for the object and its attributes are separated - I think the object instance would just hold reference to its attribute, thus not necessarily bloating the object itself.

Now I have no idea how this works, because this doesn't seem very efficient for garbage collecting.

P.S: Take what I said with a grain of salt as I don't have much a clue!

And you're right! Looks like I made a mistake in my gist, thanks for pointing out!

Edit:

I was thining, if Ruby objects really only store a reference to another part of the memory (as demonstated with the snippet above), then what happens when the pointers itself grows?

To do that I tried something else, here's the snippet.

require 'objspace'

class Foo
  attr_reader :string
  def initialize
    1000.times do |i|
      instance_variable_set("@a#{i}", i)
    end
  end
end

instance = Foo.new
puts "Foo size: #{ObjectSpace.memsize_of(instance)}"
#=> Foo size: 8904

Surprise surprise, the instance itself is now 8904 bytes! (again, I'm not sure how much the value itself matter here), so it looks like the object does get bloat if your have too much fields/attributes.

Well, thanks for asking great questions! I don't think I have all the full answers, but I had a lot of fun trying :) Hope they answered some of you questions as they did mine too!

Cheah Jer Fei • Mar 26 '19

Woah that's cool! So object does get expanded when it has too much fields, I wonder why would they still set the 40 bytes upfront 🧐

Thanks for taking your time trying to answer my questions, if you don't mind, maybe we can talk more privately? I always wanted to learn from good people 😝

Edison Yap • Mar 26 '19

Sure! It's always nice to see local talents online, so would definitely love to connect :)

Cheah Jer Fei • Mar 27 '19

Just added you on fb, or would you prefer other way to connect?

DEV Community