Michał Knasiecki

Posted on Dec 12, 2023

How to reduce MongoDB database storage size by 95% with no effort?

#mongodb #database #json #programming

One of the key things that I dedicate a lot of effort and attention to in my daily work is the size of my databases. Not everyone understands this, and some people may say I'm wasting time and money because, after all, disks are cheap. However, the truth is that I don't do it because of the disks, or at least not only. After all, keeping the size of the database under control has many other advantages:

the smaller the size of the data, the more of the so-called workingset will fit into the RAM memory and the more efficiently the database will work; in short: smaller databases work faster;
small databases backup faster, which is especially important if you manage hundreds of databases and for each of them you must guarantee a backup, let’s say, once every 24 hours or even more often;
in case of failure, a small database can be restored from the backup faster than a large one, so our system will return to operation more quickly;
and finally, of course, infrastructure costs depend on the size of the database, so smaller databases are cheaper. This applies not only to on-prem environments, but also to the cloud; for many providers, the cost of renting servers depends largely on the size of the disks.

So it's safe to say that the size of the database is one of the most important parameters we should keep an eye on. The developers of MongoDB also came out of a similar assumption by giving us a binary BSON, many different types of data, built-in compression, or dedicated time-series collections that can pack data even more densely.

However, it turns out that in special cases we can go one step further and use non-standard solutions, such as probabilistic data structures. They will allow even more efficient use of disk space, although there is a cost to be expected here, which we will discuss below.

Memory-efficient presence test

One such data structure is the Bloom Filter. It is an implementation of a memory-saving structure that is used in the so-called presence test.

To explain how it works, let's refer for a moment to one of the most popular data structures: a set. As we know, you can add elements to a set, you can remove elements from it, and you can also check whether an element is in the set.

Sets are very useful, but not perfect; one disadvantage of this structure is the linear dependence of the data size on the number of its elements: the more elements we put to the set, the more memory it will need to store it. This can be a big problem if we want to operate on a very large number of elements. Especially, if we are not really interested in the contents of the set, but only want to be able to check if some element is in it. The operation of checking whether an element is part of a set is called a presence test.

And this is where Bloom Filters prove to be helpful: we can use it to test whether elements have been placed in it, albeit without being able to extract those elements. Thus, we can't iterate through the elements of the filter, as we can with a set, we can only - knowing the element - check whether it has been placed in the Bloom Filter.

An interesting feature of Bloom Filters is their fixed size (up to a certain maximum number of elements the filter can handle, of course): there’s no linear dependence of the data size on the number of its elements.

Unfortunately, you have to pay a price for this fixed size: Bloom Filter can sometimes return false-positive results, i.e. state that a certain element is in it, while it was actually not added to it. On the other hand, a false-negative result won't happen, so if the filter claims that an element is absent, it definitely wasn’t added to it.

There are, of course, many fields in which a false positive result is unacceptable; one example would be medicine. However, there is quite a wide field of applications where we can accept a small number of false positive results in exchange for an efficient way of storing information.

Article recommendation system

Imagine building a social media type system in which some users create content for others. An obvious feature of such a system would probably be some sort of system for promoting posts. We would like to recommend to users new posts that match their interests. We can expect that our system will hopefully have so many posts to promote that we would like to present each suggestion to a given user only once. This means the need to save for each user the URL of posts whose suggestion has already been presented to them once. It will prevent us from suggesting content to users that has already been recommended before.

There are two ways to model the data described above.
It could be done in the analogous way in which we would approach this problem in the case of a relational database. So, we would keep our users in one collection and the information about the articles they visited in another. An additional collection mapping identifiers would allow this to model an M:N relationship.

This way, although possible, is not likely to be recommended for non-relational databases.
When modeling this data in MongoDB, we would probably use the ability to model complex documents. This would allow us to keep all the data in one collection, where each document would describe the user, and one of its fields would be a set containing the URLs of recommended articles.

One of the advantages of this method is simplicity: instead of two (or even three) collections, we only need one; we can also avoid using $lookup aggregator which might be expensive to use on a big database. On the other hand, however, the single-collection solution can also cause performance problems: the longer the list of articles, the more data we will have to load into memory when loading a user document. The relationship will be linear in this case.

If, in addition, we do not need to know the actual URL of the articles, but only want to be able to check whether an article has been recommended to the user, it is worth considering an alternative solution based on Bloom Filters.

This is because, instead of storing a set of recommended articles, you can store a serialized Bloom Filter in the document, which should be more compact than using the set. This means that the collection of users will be much smaller than if you store the URLs list directly.

Sounds promising, so let's see how it works in practice. And the best way to do it is to perform a simple experiment.

Time to do some experiments

There are many implementations of Bloom Filters, since I am a Java/Kotlin developer, I will use the guava library for my tests. This library has a nice additional feature: its Bloom Filter implementation includes a writeTo method that stores the filter in a very compact form, and this will be an additional benefit for us.

Note that in the examples below I will use annotations from the spring-data library, but its use is not necessary and the same program can be implemented without problems using a pure driver for MongoDB.

I have prepared two classes that map a user model to a MongoDB collection. The first one uses the classic approach: with a set containing article URLs:

@Document(collection = "users-plain")
data class PlainUser(
   @Id val id: ObjectId,
   val visitedUrls: Set<String>
)

While in the second, instead of a set, I use a Bloom Filter, which is serialized to a binary form, mapped to an array of bytes:

@Document(collection = "users-bf")
data class BloomFilterUser(@Id val id: ObjectId, 
    val visitedUrlsFilter: ByteArray)

I decided to start the tests with the assumption that the average length of URL is 50 characters (according to SEO recommendations, links with length of 50-60 characters are optimal), and each document has an average of 10 of them. I will perform the first test for 100,000 users.

The following code shows how data are being generated:

for (i in 1..usersCount) {

   //Generate URLs:
   val urls = (1..urlsCount).map { 
       RandomStringUtils.randomAlphanumeric(urlLength) 
   }.toSet()

   //Store a user with URLs in a set:
   val nextPlainUser = PlainUser(ObjectId.get(), urls)
   plainUserRepository.save(nextPlainUser)

   //Store a user with URLs as Bloom Filter:
   val bloomFilter = BloomFilter.create(
       Funnels.stringFunnel(
           Charset.defaultCharset()), urlsCount, fpp)

   urls.forEach{
       bloomFilter.put(it)
   }

   val out = ByteArrayOutputStream()
   bloomFilter.writeTo(out)

   val nextBFUser = BloomFilterUser(ObjectId.get(),
      out.toByteArray())

   bloomFilterUserRepository.save(nextBFUser)
}

Of course, the code can be written in a more clean (and efficient by using bulk inserts) way, but I wanted to keep it simple, so I decided to use a single loop with everything in it.

Variables usersCount, urlsCount and urlLength are obvious and define the parameters of the test which I mentioned above, however, the variable fpp needs clarification - this is the "false positive probability" which tells what is the probability that the filter will return true for an object that has not actually been put in to it. We can control this factor, but of course, the smaller the fpp, the less will be the profit from using the Bloom Filter because its storage size will get bigger.

Results

To compare the two methods, we will use the storageSize property that can be found in db.collection.stats() output, so it will reflect the compressed size.

Let's start with some low numbers: urlsCount = 10 so will assume that we can save only last 10 viewed articles.

The following table shows the size of the collections modeled by both methods (plain and using Bloom filters), the last column shows the size reduction we achieved:

Users count	Array [B]	Bloom Filter [B]	Diff [%]
100K	45633536	2420736	-94,7
500K	218083328	11304960	-94,8
1M	442531840	22593536	-94,8

You can see that by using filters we can achieve a size reduction of 95%. Great result, now let's see what happens when we set urlsCount = 50

Users count	Array [B]	Bloom Filter [B]	Diff [%]
100K	217296896	7933952	-96,4
500K	1068204032	38817792	-96,4
1M	2135564288	77930496	-96,4

The result has further improved, although only slightly.

Now let's see what effect the value of the fpp parameter has, we will perform a test for:urlsCount = 50 and users = 100K

fpp	Bloom Filter [B]	Diff [%]
0.001	11337728	+42,9
0.01	7933952	-
0.05	5005312	-36,9

According to the documentation, increasing the uncertainty of the result has the effect of further reducing the size and we confirmed it with the above test.

However, it should be noted here that 5% of false positives is already quite a lot, so it is worth thinking carefully before further increasing the value of this parameter.

Summary

MongoDB use very advanced tools to store data in the most compact way. It should be more than enough for the vast majority of our daily projects. In some very special cases, however, you can consider using even more advanced data structures that will allow us to reduce the size of the database to just a few percent of the initial size. We have demonstrated that using Bloom filters it is possible to achieve excellent results with a reasonable level of uncertainty about the stored data.

DEV Community