DEV Community

Cover image for A MongoDB data storing refactor story
Laura for One Beyond

Posted on

A MongoDB data storing refactor story

Over the last few months my team and I have been working on a micro-service architecture for an e-learning platform. One of the services is in charge of translating packages (books) from a given XML DITA structure into a series of content in our custom JSON format, and sending the deltas of this content through a message broker so that their current states are available on a content API, ready to be retrieved by the front-end.

To start, I’ll briefly explain the structure found on the packages we digest, as well as the requirements we have.

The package structure

A book (what we call a package) can contain the following contents:

  • Maps: structural information containing other maps and/or topics.
  • Topics: structural information containing one or more particles.
  • Particles: educational pills and learning assessments.

JSON Package tree structure

Every time a content changes, we must keep track of it. For these, we need to store three types of deltas: creations, deletions and updates.

The requirements

The service has to accomplish the following requirements:

requirements

  • 1. Import: New packages must be translated into JSON, and its deltas published.
  • 2. Reimporting: The editors should have the possibility to go back to any given version of the package.
  • 3. Reindexing: We should keep track of all the deltas for each one of the contents, to be able to repopulate the content api in the case of an inconsistency between both services.

Note that we are using a MongoDB instance in Azure CosmosDB, which we found out that has some limitations when it comes to implementing updateMany or deleteMany queries, because of the way it shards the collections.

Knowing this, let’s go through the different approaches we have implemented, and what problems we have found on the way.

First attempt: all deltas in one content document

Our first approach was to create one document on the database collection for each content (map, topic or particle), and include an events array of the deltas of that content.

Adding a helper field

Due to this structure, searching for the last event for every content led to very slow queries. For this reason, we included the lastImport object on each content, containing a reference to the last event saved on the array, to fasten the queries that didn’t need the DELETED contents.

The problem we were facing with this approach, apart from the long storing times, was that the events a*rray was going to grow* every time a change was applied to the contents they were referring to, so the document could reach the 16 megabytes mongo limit.

Second attempt: one document per event

We had to solve the problem with the growing events array, so we decided to switch the storing way to one document per event for each one of the contents.

This way we had fixed the document limit issue, but we still had to solve the slow queries issue when inserting and retrieving deltas.

Time improvements via indexing

To fasten the process we decided to investigate the usefulness of indexing different fields of the collection. We triggered a reindex and a reimport with four collections (each having a different indexed field) and we got these results:

(Time for the reindex and reimport processes with collections with different indexes)

Looking at the results, we decided to include the timestamp index, as we saw a significant reduction in the time spent for the reindex, and no difference on the reimport time.

Third attempt: storing the translations, not the deltas

Despite this small time improvement, we were still unsatisfied with the results. We wanted to significantly reduce the time spent on the imports, as the service was expected to be processing 50 products a day.

To solve it, we fully changed the storing and processing paradigm: we are now translating and storing all the incoming packages as a whole, and letting the service calculate the deltas and publish the deltas from each package on the go.

This way, we significantly reduce the storing time, as no deltas are being stored, only the package translation. At the same time, we can still keep all the translation history to go back and restore a previous version, calculating the deltas on the go whenever we want (reimport).

We only store translations ¿what about the reindex?

The only loose end at this point was the reindexing, since we would have to calculate the deltas for all of the events that occurred since the package was created.

To solve this, every time a translation was published we calculated and stored a complete history of the deltas(completeDeltas field), so we could easily trigger the reindex by searching for the last publication of that package and publishing those completeDeltas.

Mongo limits trouble again: Azure Blobs to the rescue

While testing our new implementation with a series of real packages, we came up with an old problem: the mongo collection was reaching its 16mb limit, not only when storing the completeDeltas, but also with just the translation of some big packages.

We realised we wouldn’t be able to store the translations if we kept using mongo, so we had two options: change to a relational DB in which the limits for a field is about 1Gb, and hope for a package not to ever reach that size, or change the place in which we were storing the contents and completeDeltas.

We are now storing the translations on an Azure BlobStorage, and referencing that JSON translation URL on the packages translation collection, as well as referencing the original XML content path.

Also, the last completeDeltas array is stored on the blob, and we overwrite the old versions with the new ones each time we publish the package, since we only need the last version for the reindex. The blob is organised as follows:

organization

With this new approach we are facing translations of less than a minute, and publications no longer than 5 minutes, while we can ensure that every version coming in XML is being translated and stored without overloading the process.

Top comments (0)