Thanks for this well written and comprehensive post. How do these data structures fare in terms of cache locality? I have heard that many libraries prefer radix trees as compared to B-trees because of the cost of L1-cache misses.
Thanks for the question! I am not that familiar with tries or radix trees. I know that are used as prefix trees for tasks like auto-completion.
But from what I can tell plain radix trees don't seem to be very cache-friendly. In a B-tree you have an array of values in each node, enabling prefetching for faster comparisons. But it seems that there are HAT-tries, an extension of radix trees, which have the same idea of storing an array of values in each node.
Do you have any link / reference to look at regarding the preference of radix trees over B-trees due to the cache behaviour? I'm curios and would love to read more :)
Regarding the cache-friendliness of the data structures presented here it depends on what you are doing. To me it seems that there is no single definition of "cache-friendly". An array is more cache-friendly than a linked list if you iterate. However when concurrently updating the data structure then a linked list allows for local modifications, reducing the amount of synchronization needed also in the cache.
Having said that, if you implement the log / journal in-memory as a linked list it will behave like a linked list. However in practice it is mostly persisted on disk, written once, such that cache friendliness is not that relevant. For LSM trees it depends on which data structures you use on each level.
Maybe someone else can also comment. Also please correct me if I'm wrong :)
I read it while looking up Judy Arrays. Their documentation is a little dated (circa 2000) but quite detailed. They mention specifically that 256-ary tree is very good in terms of cache-locality. Judy arrays are recommended to be used as sparse hash-tables though, so the use case is different from the one in your article.
What you said about cache-friendliness is true. I was thinking in terms of iteration.
Thank you for the link. I also found the post Hashtables vs Judy Arrays, Round 1 and it shows different scenarios in which either data structure excels.
It is safe to say that optimizing algorithms and data structures for the memory hierarchy, esp. CPU caches, is something where you can get a lot of performance boost. When I read Cache-Oblivious Algorithms and Data Structures I realized that there is a whole area of research around that and I know almost nothing :D
The link you shared is great. That will keep me busy for a while, thanks.
I think Data Engineering and my future are related, so I'll keep your series of articles in bookmark.
I found all the concepts separately, but not in relationship to the "real world", as in current products that use them, and not all technologies have white papers.
Glad to hear that you like Data Engineering and also that my posts are useful to you :)
I agree with you that it's sometimes difficult to figure out what data structures are used under the hood of different software products. If it is open source you can at least dig into the code. But the official websites mostly contain buzzwords and marketing figures.
If you want to know more about a specific product it's also useful to see if there are mailing lists, chats, or forums you can join. When working on this article I also talked to the developer of SwayDB, clarifying a few questions I had about the internals.
And one more thing. In case you don't know it already, there is an amazing collection of articles about the correctness and robustness of different databases and distributed systems: aphyr.com/tags/Jepsen I like to read this to get a different view on the technologies.
Thanks for the post, i was waiting for it. I will have to read it again more carrefully. Very intresting, thanks again.
These posts take me like a week to get through (just dense and I have little time) but they are sooo worth it :D, thanks Frank!
We're a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.