RUM Conjecture - Reasoning About Data Access

Frank Rosner on April 16, 2018

Motivation In one of my previous posts I was talking about choosing the right data model in order to avoid invalid state and to make c... [Read Full]
markdown guide
 

Surprised this hasn't gotten more comments, it's a novel way of reasoning about data structure choice. Big O is fine for measuring run times, but it does little to show the relationship between these three operations. 2 question: 1) how did you come across this? a quick google shows little writing on the subject 2) can you give an example of all 3 data structures you described? I am struggling to understand this MO = 1 structure.

 

Hi Daniel,

MO = 1 by definition can only be reached if the amount of auxiliary data = 0 (except when using approximate data structures like Bloom filters). A way to achieve optimal memory overhead for our integer example is to store all integers in a dense array of exactly the correct size. No pointers, no additional data storage involved, only the raw numbers in memory.

This post contains an example implementation for read, update, and memory optimal solutions of the integer array. The following posts (still working on the last one) focus a bit on more realistic implementations.

I stumbled upon the RUM paper when digging into database internals. There is an amazing post series On Disk IO (Pt. 4) by Alex Petrov which is where he mentions it. I also liked it as a meta view on the tradeoffs you get when designing data structures and access methods.

Glad you liked it! Let me know what you think about my other posts from this series.

 

My first thought was the Bloom Filter as well, but I was wondering what the implementation of a MO = 1 deterministic structure would look like. The rest of the series is in my reading queue :D, looking forward to getting to them. Excellent post!

Got it! Was my explanation of the deterministic MO = 1 structure understandable?

I think so, I will look for the practical one in the next piece though to make sure I get it. My sticking point is if you need to store N numbers in a set (or record that they are present in the set), how can the structure be always the same size without losing fidelity like the Bloom Filter does?

Ahhh now I see where the issue is. Please take a closer look at the definition of MO. It is not equal to the space consumed by the data structure but the ratio between space consumed and actual data being stored.

So if for each 32 bit number (base data) we store we also had to store a 32 bit pointer (auxiliary data), our MO = (32*n + 32*n) / (32*n) = (32 + 32) / 32 = 64 / 32 = 2.

Having said that it should be clear that in linear structures like lists or arrays the MO should always be constant and it also makes sense as for each element we add we have the same overhead in terms of space (e.g. a new pointer).

If you look at two dimensional data structures like trees you will see that the overhead grows with n. In a tree, adding more elements requires a growing amount of auxiliary data to be added.

As a rule of thumb you can take the asymptotic space requirement, e.g. O(n), and devide it by the number of elements in the data structure to get the asymptotic MO requirement, e.g. O(1).

Makes sense? Thanks for asking I think this part was not clear for many people, including me when I first read the paper.

Please note that in a log, i.e. the update optimal solution, we also have a linear structure but non-linear memory overhead. This is because our logical structure is a set and not a list and all the "outdated" entries in the log must be considered as they consume space although they are not really part of the logical set anymore. Do you know what I mean?

Man, totally, I completely like... read the MO definition, grokked it, then promptly forgot to think about it! Thank you, Frank!

 

Hello,
Nice post.
So you say you are using this formula on your post:
UO = (BU + AU) / BU + RO

But i do not see it in action. With said formula all the UO should not be at least 2 ?

Thanks

 

Ok I read the post again and I'm not sure if we have UO = 2 somewhere.

  • In the read optimal case, we do not have any RO associated with an update as we know the position by index without reading the data.
  • In the update optimal case, we also do not have any RO as we simply append to the log.
  • In the memory optimal case we can have UO > N, as we need to perform N reads in case the element to be inserted is not part of the set.

Does it make sense?

 

Hmmm,
For the first case, for us to know the position by index shouldn't need to read the position first ?

For the third case i did not understand what are you saying.

Thanks,

For the first case, for us to know the position by index shouldn't need to read the position first ?

What do you mean by reading the position? We can simple insert or delete the value without knowing whether it was there already.

For the third case i did not understand what are you saying.

What is it that you don't understand?

 

Usually the metric that is widly used to express the efficency of search, insert and delete operations alongside with space usage is the Big-O notation. It is more common in that regard.

 

Agreed. The RUM conjecture is more about the trade-off you get when trying to optimize read and write performance + space overhead of a datastructure. Big-O is about asymptotic complexity and can be applied to the RUM overheads. You can express them in Big-O as well, as I am doing in the second post.

I think it's not so much about the actual way to measure the overheads rather than understanding the trade-off that is formulated in the conjecture.

code of conduct - report abuse