In this series of articles, I show how we can reduce database size and improve it’s performance just by using the right structures and proper data types. In the previous post, we looked at how the _id
field type affects the size of the database. The results were promising, so it's worth taking a broader look at the topic. Therefore what else can we do to reduce the size of data on disk?
Simple and complex structures
Let's move on to more complex data. Our models are often built from smaller structures grouping data such as user, order or offer. I decided to compare the size of documents storing such structures and their flat counterparts:
{"_id": 1, "user": {"name": "Jan", "surname": "Nowak"}}
and
{"_id": 1, "userName": "Jan", "userSurname": "Nowak"}
In both cases we store identical data, the documents differ only in the schema. Take a look at the result:
There is a slight reduction by 0.4MB. It may seem not much compared to the effect we achieved for the field containing an ID. However, we have to bear in mind that in this case we were dealing with a more complex document. It contained – in addition to the compared fields – a numerical type identifier that, as we remember from previous experiments, takes up about 5MB of disk space.
Taking this into account in the above results we are talking about a decrease from 3.4MB to 3MB. It actually looks better as percentage - we saved 12% of the space needed to store personal data.
Let's go back to the discussed documents for a moment:
{"_id": 1, "user": {"name": "Jan", "surname": "Nowak"}}
and
{"_id": 1, "userName": "Jan", "userSurname": "Nowak"}
A watchful eye will notice that I used longer field names in the document after flattening. So instead of user.name
and user.surname
I made userName
and userSurname
. I did it automatically, a bit unconsciously, to make the resulting JSON
more readable. However, if by changing only the schema of the document from compound to flat we managed to reduce the size of the data, maybe it is worth to go a step further and shorten the field names?
I decided to add a third document for comparison, flattened and with shorter field names:
{"_id": 1, "name": "Jan", "surname": "Nowak"}
The results are shown in the chart below:
The result is even better than just flattening. Apart from the document’s key size, we achieved a decrease from 3.4MB to 2MB. Why does this happen even though we store exactly the same data?
The reason for the decrease is the nature of NoSQL databases that, unlike the relational ones, do not have a schema defined at the level of the entire collection. If someone is very stubborn, they can store user data, offers, orders and payments in one collection. It would still be possible to read, index and search that data. This is because the database, in addition to the data itself, stores its schema with each document. Thus, the total size of a document consists of its data and its schema. And that explains the whole puzzle. By reducing the size of the schema, we also reduce the size of each document, i.e. the size of the final file with the collection data. It is worth taking this into account when designing a collection schema in order not to blow it up too much. Of course, you cannot go to extremes, because that would lead us to fields named a
, b
and c
, what would make the collection completely illegible for a human.
For very large collections, however, this approach is used, an example of which is the MongoDB operation log that contains fields called:
- h
- op
- ns
- o
- o2
- b
Empty fields
Since we are at the document's schema, it is still worth looking at the problem of blank fields. In the case of JSON
the lack of value in a certain field can be written in two ways, either directly - by writing null in its value - or by not serializing the field at all. I prepared a comparison of documents with the following structure:
{ "id" : 1 }
and
{ "id" : 1, "phone" : null}
The meaning of the data in both documents is identical - the user has no phone number. However, the schema of the second document is different from the first one because it contains two fields.
Here are the results:
The results are quite surprising: saving a million null’s is quite expensive as it takes more than 1MB on a disk.
That's it for today. In the next entry we will take a look at enumerations and unused fields. If you find this topic interesting, please let me know by like/unicorn, so I know it's worth continuing.
Top comments (0)