loading...
Cover image for Ruby Compact Internals

Ruby Compact Internals

oryanmoshe profile image Oryan Moshe ・4 min read

I encountered a weird phenomenon regarding ruby's compact method, and I thought I'd share my findings with you.
When working on "The Grid" dashboard here at monday.com, I got to a point where I had an array of hashes, and some of the hashes were empty.
I needed to get rid of those.


Bottom line first

To get the result I wanted I just did a.reject { |v| v.blank? } instead of a.compact.


Expectation VS Reality.

When I used compact on the array I was a bit surprised by the result.

a = [{ key: 'value' }, { key: 'value2' }, {}]

puts a.compact
#[{ key: 'value' }, { key: 'value2' }, {}]

I expected the compact method to just go over each element in the array, check if it's present? and if it is return it.
Something like this:

def compact
  select { |value| value.present? }
end

So when I had an array with empty hashes that wouldn't go away I had to find the explanation.


The implementations behind the scenes

As usual, I dug deeper for the implementations.

From past research I know ruby's Hash also has a compact method and I know it's pretty straightforward, for each key in the hash we check if it's nil, if it's not we return it.
Super easy. Similar to what I suggested above, easy to understand, although it wouldn't solve the issue (because we use nil? instead of blank? or !present?)*

activesupport/lib/active_support/core_ext/hash/compact.rb, line 12

def compact
  select { |_, value| !value.nil? }
end

In arrays however the implementation is more complex. It is implemented purely in C.

static VALUE
rb_ary_compact(VALUE ary)
{
  ary = rb_ary_dup(ary);
  rb_ary_compact_bang(ary);
  return ary;
}

C is love, C is life

If you don't have any background in C, this next part might be less interesting to you. I'll try to make it as understandable as possible, but I highly recommend learning C if you have the time!

Pointers background

In case you don't know, the way we access our program's memory (the place where the variables are stored) is by using pointers.
These pointers are basically just an address in the memory where we can find the beginning of our variable (each variable takes a different amount of memory, we just get the required number of bytes starting at the pointer's position)

Arrays work similarly (actually, an array is literally a pointer), let's say we have an array of 3 characters, each character consists of one byte, all 3 bytes are stored one after the other in memory.
We can go to the first address in the array (or the pointer to that array), and take 1 byte, that's the first character. Then we take another, that's the second, and then the third.

char arr[3] = {'m', 'i', 'p'}
/*
Behind the scenes it's stored like so:
... 6d 69 70 ...
    ^ arr[0] == *arr
... 6d 69 70 ...
       ^ arr[1] == *(arr + 1)
... 6d 69 70 ...
          ^ arr[2] == *(arr + 2)
*/

Another thing to note: in C, if we have a pointer stored in a variable, let's say ptr, this variable will contain the address to the value. To get to the value itself we have to use the "dereference" operator (an asterisk before the variable name) like so *ptr.

Back to code

Above we saw the C implementation of ruby's array compact, we can see a function that receives an array, duplicates it (using the rb_ary_dup function), compacts it using the rb_ary_compact_bang function ("bang" is the same as the "!" symbol, so it basically runs compact!), and then returns the duplicated, compacted array.

To find that rb_ary_compact_bang function I went to ruby's source code:

ruby/array.c, line 5022

static VALUE
rb_ary_compact_bang(VALUE ary)
{
  VALUE *p, *t, *end;
  long n;

  rb_ary_modify(ary);
  p = t = (VALUE *)RARRAY_CONST_PTR_TRANSIENT(ary); /* WB: no new reference */
  end = p + RARRAY_LEN(ary);

  while (t < end) {
    if (NIL_P(*t)) t++;
    else *p++ = *t++;
  }
  n = p - RARRAY_CONST_PTR_TRANSIENT(ary);
  if (RARRAY_LEN(ary) == n) {
    return Qnil;
  }
  ary_resize_smaller(ary, n);

  return ary;
}

Don't close the post! This code isn't friendly for people with no experience in C, but most of it is memory management, some games of allocating memory correctly, resizing arrays to prevent memory leaks etc. I can do a deeper dive in a seperate post if anyone wants.

The interesting part is this one:

while (t < end) {
  if (NIL_P(*t)) t++;
  else *p++ = *t++;
}

Line by line

As I explained above, an array is just a set of variables stored one after the other in our program's memory, so we can get a pointer to the first variable, and then just increase it by 1 every time to get to the next one (t++ increases t by 1)

What we have here is a loop that goes from the beginning of the array, with our current position in the t variable (a pointer to the array), and we iterate over every element until we reach the end of the array (the address of the beginning of the array + the length of the array)

Every time, we check whether the current value is nil (using the NIL_P macro defined in include/ruby/ruby.h line 482)

If it is nil we just increase the pointer, advancing one spot in our array.
If it's not nil we put the current value from the original t (remember, the * dereferences the pointer) into another array we keep stored in the pointer p, and then advance the pointer one spot.

Bottom line

In a nutshell, it's the same as doing a.reject { |v| v.nil? } but more memory aware and efficient.

Just to show the difference in efficiency I benchmarked both ways against an array of 20,000,000 elements with alternating integers and nils.

As you can see, array's compact method using C null pointer elimination is 6.6 times faster than the reject method!

I hope you enjoyed learning a bit about ruby's internal mechanisms!

* This is the activesupport implementation of Hash compact, if you'd like to see the C implementation you can go to ruby/hash.c line 4110, but it's pretty similar!

Posted on by:

Discussion

markdown guide