Hey guys 👋, I'm going to explain mongoDB data modelling theory, practises and ideas in this post. The next post will probably be about actually implementing this in real code.
For the purpose of this post, We're modelling a database for an e -commerce app. We have users, products, reviews and orders.
Data modelling in simple terms is creating a model or a structure for data to be stored in a database. It is a representation of data and their associations or relations
Data modelling is the process of taking unstructured data generated by a real world scenario and then structure it into a logical data model
Through data modelling, we structure the data. Keep in mind that there can be many ways to do this and there is not 1 correct solution.
- Identify different types of relationships between data
There are 3 different types -
A 1:1 relationship means that 1 field can only have 1 value. This type is not very important and is fairly straightforward. For example, in our app, 1 user can have only 1 name and email.
This is the most important category. This is so important that this has 3 sub categories in mongoDB -
This means that 1 field can have a few values. For example, A user can have a few orders.
This means that 1 field can have many values. For example, A
product can have many reviews.
This means that 1 field can have a thousands of values. An easy way to identify fields that come in this category is by seeing if the field has the potential to scale to infinity For example, we have implemented some logging functionality of various features to analyse. These logs can really increase if the website scales.
It is often difficult to distinguish what a field might fall into(especially between 1:FEW and 1:MANY), there is no well defined way to do this. Eg - A product may have few or many reviews. It depends on various things like it's price, popularity etc. The reviews can be categorised in either 1:MANY or 1:FEW but you cannot say 1:TON because it's quite rare that a product has so many reviews(even on a popular website like amazon, popular products will rarely have more than a 1000 reviews). The goal here is to try and generalise as much as possible.
For example, An actor can act in many movies and a movie can have many actors. When I was learning this, I struggled with visualising this. So, a quick diagram -
- Referencing VS Embedding (Datasets) When we have to related datasets, we can either reference(normalise) or embed(denormalise) these. I'm going to use our previous example of movies and actors.
Let's assume that we reference these collections. The actors & the movies would be two separate collections. Every actor will have a reference to the movies he/she has worked in and every movie will have a reference to its actors. Here's a diagram -
As you can see, every actor has an array of IDs of movies that he/she has worked in and each movie has an array of IDs of actors that have acted in it.
The advantage of this is that the same actors/movies collections can be used by another collection. However, every time we query from one of the collections, because the other collection references it, the database also sends a query to the other collection.
Let's say that we embed these collections. We have a collection of movies. In which we put the actors. A diagram -
If we embed this, we only have to make 1 query to the database and our applications will be faster. However, these actors can only be used by the movies. Any other collection can't use it.
We use 3 criteria to make this decision. We should combine all 3 criteria to decide. -
1) Relationship type(how the data set is related)
Embed when there is a 1:FEW or 1:MANY relationship
Reference when there is a 1:MANY, 1:TON or MANY:MANY relationship. This is because there is a lot of data and the maximum size per document(16 MB) in mongoDB might be exceeded.
NOTE:- In places where the same type of relationship is in both category, use the other criteria to determine the relationship
2) Data access pattern(how often the data is accessed)
- Embedding Embed when data is mostly read and not changed frequently.
We should reference documents when the same document has to be used across various collections. We should embed only when we know the document will be used only in 1 place. If we embed data that is read a lot, we will have to make only 1 and will improve the performance but in referencing we will have to make 2 queries.
- Referencing Reference when data is updated a lot. This is because it is more work for the database to update and embed compared to just updating a standalone document.
3) Data closeness(how "much" data is accessed, how we want to query)
Embed when data really belongs together. For example, a guides collection and a tours collection. A guide belongs to the tour.
Reference when both the datasets are frequently queried on their own. For example, products and users. An user can buy a product but can also post a review. If we embed the user in a product then the reviews collection cannot access it.
1) Child referencing
The parent has an array of IDs of all its children which are used to reference them. This is good in 1:FEW relationships. However, this is not good for 1:MANY & 1:TON relationships as each time a new child is created the parent also need to be updated with the ID of the new child. Also, we might hit the 16 MB limit of mongoDB Let's look at the next type for solving this issue.
2) Parent referencing
Each child has an ID that points to the parent. In this case, the parent knows nothing about its children.
3) Cross referencing
This is used in MANY:MANY relationships. The parent has an array of IDs of its children and the child has the ID of the parent.
Here's a diagram to visualise this -
Oohf, that's quite a long post 😅. Sorry for making it so lengthy but I wanted to cover all information. Hope you liked it!
Tip of the day - Frontity(a react framework) connects React to Wordpress