Motivation
In one of my previous posts I was talking about choosing the right data model in order to avoid invalid state and to make code more understandable. In addition to choosing the right data model it is also important to choose the right implementation. You might have to decide whether to use a hash or a tree based implementation for your set. For persisting data inside a database, you also have different options, e.g. SQLite, PostgreSQL, or Cassandra. How to decide which one to use?
Data systems researchers developed a theoretical model to reason about, and compare different data structures in terms of their overhead when reading (R), updating (U), and storing the data in memory (M). They formulated the results in the so called RUM conjecture [1]. In this blog post we are going to look at the RUM overheads and the trade-offs associated with them. We are also taking a look at the conjecture and its implications for us as developers.
This blog post is the first part of a series about data structures focusing on the access overheads:
- RUM Conjecture - Reasoning About Data Access
- Read Efficient Data Structures
- Update Efficient Data Structures
- Memory Efficient Data Structures
The post is structured as follows. The first section formally defines RUM overheads. The next section introduces the RUM conjecture, which explains how the three overheads are connected and influencing each other. We are going to look at some example implementations as well. Afterwards there will be a discussion about the implications of the RUM conjecture for us as developers. We are closing the blog post with an outlook for future posts.
RUM Overheads
Given base data as the main data that is stored in the data management system, and auxiliary data as additional data such as indexes to improve access performance, we can define the following access metrics:
- BR as the amount of base data read
- AR as the amount of auxiliary data read
- RR as the amount of data returned by a read query
- BU as the amount of base data updated
- AU as the amount of auxiliary data updated
- BM as the amount of space required to store the base data
- AM as the amount of space required to store the auxiliary data
The authors in [1] define the RUM overheads as follows:
- Read Overhead RO = (BR + AR) / RR
- Update Overhead UO = (BU + AU) / BU
- Memory Overhead MO = (BM + AM) / BM
While this definition seems intuitive in my opinion there is one problem with it. When performing an update you might need to perform read operations first (e.g. to find the value to update), depending on the access pattern and data structure you are using. This is not captured by the update overhead described in the paper.
In my opinion it makes sense to include the read overhead in the update overhead calculation if an update requires a read. In a later section we are going to see a simple example and also why it matters. For the rest of this blog post we are thus going to use the following definitions:
- Read Overhead RO = (BR + AR) / RR
- Update Overhead UO = (BU + AU) / BU + RO
- Memory Overhead MO = (BM + AM) / BM
Now let's take a look on how the different overheads relate to each other when it comes to choosing the right access method for your data.
RUM Conjecture
Definition
The RUM Conjecture. An access method that can set an upper bound for two out of the read, update, and memory overheads, also sets a lower bound for the third overhead. [1]
In other words choosing the right access method becomes an optimization problem. E.g. if your data structure should support fast reads and consume little space you have to accept higher update overhead. To visualize this effect we are going to take a look at an example which varies a bit from the one in the paper.
Minimizing RUM Overheads
Consider the following problem: We want to store a set of N integers (i.e. no duplicates allowed). To get an intuition of the trade-off between the different overheads we are going to look at three different implementations, each one minimizing one of the three overheads. For simplicity reasons let's assume that the numbers we can store are between 0 and 1000.
The given examples are very simple solutions. You are probably aware of other data structures that would be more suitable (e.g. a hash or tree based implementation). Looking at the following implementations, however, gives a good intuition on how minimizing one overhead affects the others.
Minimizing RO
To minimize RO we can store the integers in a boolean array a
of size 1001
. The array elements are initialized with false
. To mark an integer i
as member of the set, we set a[i] := true
. A point query can now look up the membership directly by accessing the array at the corresponding index. Updates are performed the same way and don't require an additional read as we can access the element by index. The space required for this data structure depends on the possible values that can be stored in the set and it theoretically infinite.
When choosing this implementation, we derive the following RUM overheads of the RO-minimizing implementation:
- RO = 1
- UO = 1
- MO → ∞
Minimizing UO
To minimize UO we store each insert and delete operation in a sequence which supports constant time appends (e.g. a list). Updates correspond to appending the value and the operation (whether it is inserted or deleted) to the update log. A point query requires us to walk through the log starting from the latest update. If we find an insert operation of the value we are looking for it is a member of the set. If we find a delete operation or reach the end of the log the value is not a member. The space required for this data structure grows with every update operation.
We can derive the following RUM overheads for the UO-minimizing implementation:
- RO → ∞
- UO = 1
- MO → ∞
Minimizing MO
To minimize MO we store all values of the set in a dense array. A point query now requires a full scan of the data. Update operations also require a full scan: An insert needs to check if the value is already there to avoid duplicates and a delete needs to remove the value from the sequence. Also on every delete we have to fill the gap. The space required however is minimal, as only the integers are stored and no auxiliary data is required.
This gives us the following RUM overheads for the MO-minimizing implementation:
- RO ≤ N
- UO ≤ N + 2
- MO = 1
Implications
When choosing a physical representation of your data (or choosing a database which uses a specific representation), there will always be a trade-off between read, update, and memory overhead. Taking the memory hierarchy (see my previous blog post) into account the overheads have different impact, depending on whether the data is accessed within the CPU cache, the main memory, or a persistent storage for example.
Having this in mind it makes sense to first understand what the data you are going to store looks like and which are the typical access patterns going to be. Will there be many writes? What is the required response time of a read? How much space do you have available?
Outlook
Going forward I want to shed more light on some common data structures focusing on low read (e.g. B-Trees [2] and Skip Lists [3]), write (e.g. Log-structured Merge Trees [4]), and space (e.g. Bloom Filters [5] and ZoneMaps [6]) overhead. There is also research about data structures that can adapt based on the access patterns (e.g. database cracking [7] and adaptive indexing [8]). So stay tuned for the upcoming posts.
Did you ever take the RUM overheads into account when selecting a data structure or data base? Are you aware of the internals of the databases that you are using every day? Do you agree with my change to the update overhead, incorporating also the read overhead? Let me know your thoughts in the comments.
References
- [1] Athanassoulis, M., Kester, M.S., Maas, L.M., Stoica, R., Idreos, S., Ailamaki, A. and Callaghan, M., 2016. Designing Access Methods: The RUM Conjecture. In EDBT (Vol. 2016, pp. 461-466). PDF
- [2] G. Graefe. Modern B-Tree Techniques. Found. Trends Databases, 3(4):203–402, 2011.
- [3] W. Pugh. Skip lists: a probabilistic alternative to balanced trees. CACM, 33(6):668–676, 1990.
- [4] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996.
- [5] B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. CACM, 13(7):422–426, 1970.
- [6] P. Francisco. The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics. IBM Redbooks, 2011.
- [7] S. Idreos, M. L. Kersten, and S. Manegold. Database Cracking. In CIDR, 2007.
- [8] G. Graefe, F. Halim, S. Idreos, H. Kuno, and S. Manegold. Concurrency control for adaptive indexing. PVLDB, 5(7):656–667, 2012.
- Cover Image by O'Dea at WikiCommons, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=12469449
Posted on by:
Frank Rosner
My professional interests are cloud and big data technologies, machine learning, and software development. I like to read source code and research papers to understand how stuff works.
Discussion
Surprised this hasn't gotten more comments, it's a novel way of reasoning about data structure choice. Big O is fine for measuring run times, but it does little to show the relationship between these three operations. 2 question: 1) how did you come across this? a quick google shows little writing on the subject 2) can you give an example of all 3 data structures you described? I am struggling to understand this
MO = 1
structure.Hi Daniel,
MO = 1 by definition can only be reached if the amount of auxiliary data = 0 (except when using approximate data structures like Bloom filters). A way to achieve optimal memory overhead for our integer example is to store all integers in a dense array of exactly the correct size. No pointers, no additional data storage involved, only the raw numbers in memory.
This post contains an example implementation for read, update, and memory optimal solutions of the integer array. The following posts (still working on the last one) focus a bit on more realistic implementations.
I stumbled upon the RUM paper when digging into database internals. There is an amazing post series On Disk IO (Pt. 4) by Alex Petrov which is where he mentions it. I also liked it as a meta view on the tradeoffs you get when designing data structures and access methods.
Glad you liked it! Let me know what you think about my other posts from this series.
My first thought was the Bloom Filter as well, but I was wondering what the implementation of a MO = 1 deterministic structure would look like. The rest of the series is in my reading queue :D, looking forward to getting to them. Excellent post!
Got it! Was my explanation of the deterministic MO = 1 structure understandable?
I think so, I will look for the practical one in the next piece though to make sure I get it. My sticking point is if you need to store N numbers in a set (or record that they are present in the set), how can the structure be always the same size without losing fidelity like the Bloom Filter does?
Ahhh now I see where the issue is. Please take a closer look at the definition of MO. It is not equal to the space consumed by the data structure but the ratio between space consumed and actual data being stored.
So if for each 32 bit number (base data) we store we also had to store a 32 bit pointer (auxiliary data), our MO = (32*n + 32*n) / (32*n) = (32 + 32) / 32 = 64 / 32 = 2.
Having said that it should be clear that in linear structures like lists or arrays the MO should always be constant and it also makes sense as for each element we add we have the same overhead in terms of space (e.g. a new pointer).
If you look at two dimensional data structures like trees you will see that the overhead grows with n. In a tree, adding more elements requires a growing amount of auxiliary data to be added.
As a rule of thumb you can take the asymptotic space requirement, e.g. O(n), and devide it by the number of elements in the data structure to get the asymptotic MO requirement, e.g. O(1).
Makes sense? Thanks for asking I think this part was not clear for many people, including me when I first read the paper.
Please note that in a log, i.e. the update optimal solution, we also have a linear structure but non-linear memory overhead. This is because our logical structure is a set and not a list and all the "outdated" entries in the log must be considered as they consume space although they are not really part of the logical set anymore. Do you know what I mean?
Man, totally, I completely like... read the MO definition, grokked it, then promptly forgot to think about it! Thank you, Frank!
Hello,
Nice post.
So you say you are using this formula on your post:
UO = (BU + AU) / BU + RO
But i do not see it in action. With said formula all the UO should not be at least 2 ?
Thanks
Ok I read the post again and I'm not sure if we have UO = 2 somewhere.
Does it make sense?
Hmmm,
For the first case, for us to know the position by index shouldn't need to read the position first ?
For the third case i did not understand what are you saying.
Thanks,
What do you mean by reading the position? We can simple insert or delete the value without knowing whether it was there already.
What is it that you don't understand?
Usually the metric that is widly used to express the efficency of search, insert and delete operations alongside with space usage is the Big-O notation. It is more common in that regard.
Agreed. The RUM conjecture is more about the trade-off you get when trying to optimize read and write performance + space overhead of a datastructure. Big-O is about asymptotic complexity and can be applied to the RUM overheads. You can express them in Big-O as well, as I am doing in the second post.
I think it's not so much about the actual way to measure the overheads rather than understanding the trade-off that is formulated in the conjecture.