DEV Community

Memory Efficient Data Structures

Frank Rosner on June 06, 2018

Introduction The first blog post of the RUM series introduced the RUM conjecture. It describes the trade-off between read (RO), update (...

Read full post

Jilles van Gurp • Jun 7 '18

A few years ago I was doing some data processing that is not quite big data but also uncomfortably large enough data that I needed to worry about these things. I learned quite a lot about these things. Lesson number 1 is that you can stand on the shoulders of giants; there are lots of great libraries out there as well as algorithms that are well documented. So, reuse before you start having too much fun reinventing wheels poorly. Especially in the Java world, there are some great libraries out there to do things that need to be fast, compact, and scalable. And Java's collection framework is pretty advanced compared to what you get in most other languages.

I was processing open street maps data to get it in json form for indexing into elasticsearch. Since OSM is a normalized db dump, that takes a bit of processing and I ended up writing a tool for doing huge joins without having to use a DB or insane amounts of memory. So I was doing lots of parsing, juggling things around in memory, and trying to sort things.

Turns out that the memory overhead for small hash tables is substantial enough that you might want to consider something more optimal than Java's default HashMap if you want to do things like caching millions of node objects in memory (out of billions).

For example one thing you may want to consider is how many keys you store in an object. If you are doing less than 20 keys most of the time, a double array list based implementation is actually faster and more compact than a typical hash table implementation. Also, if you are doing hundreds of thousands or millions of keys, you need to think about your hash function and bucket sizes. Hashtables are not great for storing that many keys if each key ends up in its own bucket. I've done things like take a fast hash function and simply doing a modulo 10000 to get some predictable performance and memory usage. You trade off lookup/insert time against memory. More buckets means more memory overhead per key but faster lookups.

I actually implemented a jackson add on library jsonj that includes some alternate hash table implementations that allowed me to do this with reasonable heap sizes for different numbers of keys. This allowed me to chunk the data into managable portions (thousands) each small enough that I could do in memory joins and spit out geojson shapes with nodes organized in lines and polygons in less than half the time it would have taken to even just import the data into postgresql.

If you need to do some expensive operation iteratively where the results are somehow positive for only a minority of cases, bloom filters are a great way to avoid doing unnecessary work in the majority of cases (web service call, db lookup, anything involving io or excessive computation) and way more memory efficient than a cache. You simply fill the bloomfilter with the cases where you know things need the expensive operation and then run the big iterative operation. You get the occasional false positive but every time you get a negative from the bloom filter, you can safely avoid doing the expensive work. Guava has a nice bloomfilter implementation. I used that a couple of times to avoid doing expensive lookups to cut down an operation that used to take days to hours.

Kyle Galbraith • Jun 7 '18

Great write up Frank. This dives deep but also does a nice job of giving real examples.

Frank Rosner • Jun 7 '18

Thank you so much Kyle! Glad you liked it. It's hard to strike a balance between theory and practice but as this is the practical dev after all, I always try to put some practical part :D