Discussion on: RUM Conjecture - Reasoning About Data Access

View post

Surprised this hasn't gotten more comments, it's a novel way of reasoning about data structure choice. Big O is fine for measuring run times, but it does little to show the relationship between these three operations. 2 question: 1) how did you come across this? a quick google shows little writing on the subject 2) can you give an example of all 3 data structures you described? I am struggling to understand this MO = 1 structure.

Frank Rosner • Jun 5 '18

Hi Daniel,

MO = 1 by definition can only be reached if the amount of auxiliary data = 0 (except when using approximate data structures like Bloom filters). A way to achieve optimal memory overhead for our integer example is to store all integers in a dense array of exactly the correct size. No pointers, no additional data storage involved, only the raw numbers in memory.

This post contains an example implementation for read, update, and memory optimal solutions of the integer array. The following posts (still working on the last one) focus a bit on more realistic implementations.

I stumbled upon the RUM paper when digging into database internals. There is an amazing post series On Disk IO (Pt. 4) by Alex Petrov which is where he mentions it. I also liked it as a meta view on the tradeoffs you get when designing data structures and access methods.

Glad you liked it! Let me know what you think about my other posts from this series.

Daniel Golant • Jun 5 '18

My first thought was the Bloom Filter as well, but I was wondering what the implementation of a MO = 1 deterministic structure would look like. The rest of the series is in my reading queue :D, looking forward to getting to them. Excellent post!

Frank Rosner • Jun 5 '18

Got it! Was my explanation of the deterministic MO = 1 structure understandable?

Daniel Golant • Jun 5 '18

I think so, I will look for the practical one in the next piece though to make sure I get it. My sticking point is if you need to store N numbers in a set (or record that they are present in the set), how can the structure be always the same size without losing fidelity like the Bloom Filter does?

Frank Rosner • Jun 6 '18 • Edited

Ahhh now I see where the issue is. Please take a closer look at the definition of MO. It is not equal to the space consumed by the data structure but the ratio between space consumed and actual data being stored.

So if for each 32 bit number (base data) we store we also had to store a 32 bit pointer (auxiliary data), our MO = (32*n + 32*n) / (32*n) = (32 + 32) / 32 = 64 / 32 = 2.

Having said that it should be clear that in linear structures like lists or arrays the MO should always be constant and it also makes sense as for each element we add we have the same overhead in terms of space (e.g. a new pointer).

If you look at two dimensional data structures like trees you will see that the overhead grows with n. In a tree, adding more elements requires a growing amount of auxiliary data to be added.

As a rule of thumb you can take the asymptotic space requirement, e.g. O(n), and devide it by the number of elements in the data structure to get the asymptotic MO requirement, e.g. O(1).

Makes sense? Thanks for asking I think this part was not clear for many people, including me when I first read the paper.

Frank Rosner • Jun 6 '18

Please note that in a log, i.e. the update optimal solution, we also have a linear structure but non-linear memory overhead. This is because our logical structure is a set and not a list and all the "outdated" entries in the log must be considered as they consume space although they are not really part of the logical set anymore. Do you know what I mean?

Daniel Golant • Jun 6 '18

Man, totally, I completely like... read the MO definition, grokked it, then promptly forgot to think about it! Thank you, Frank!