ahmed khaled

Posted on Oct 24, 2022 • Updated on Jul 5, 2023

Best Practices in NoSQL Database Design

#nosql #mongodb #database #desingdatabase

NoSQL database designers employ a distinct approach to database design compared to traditional relational database designers. When opting for a document database, designers and application developers prioritize scalability and flexibility. While ensuring data consistency remains important, they willingly accept additional responsibilities to prevent data anomalies in exchange for these benefits. For instance, if there are redundant copies of customer addresses in the database, an application developer might implement a customer address update function that updates all instances of an address. Consequently, developers are inclined to write more code in order to avoid anomalies in a document database, reducing the need for extensive database tuning and query optimization in the future.

To enhance performance in document data modeling and application development, minimizing the reliance on joins becomes paramount. This optimization technique is commonly referred to as denormalization. The underlying concept involves storing data that is frequently accessed together within a single data structure, such as a table in a relational database or a document in a document database.

Denormalization

To illustrate the benefits of denormalization, let's consider a simple example involving order items and products. In the original design, the Order_Items entity has attributes such as order_item_ID, order_id, quantity, cost_per_unit, and product_id. The Products entity, on the other hand, includes attributes like product_ID, product_description, product_name, product_category, and list_price.

Here is an example of an order item document:

{
  "order_item_ID": 834838,
  "order_ID": 8827,
  "quantity": 3,
  "cost_per_unit": 8.50,
  "product_ID": 3648
}

And here is an example of a product document:

{
  "product_ID": 3648,
  "product_description": "1 package laser printer paper. 100% recycled.",
  "product_name": "Eco-friendly Printer Paper",
  "product_category": "office supplies",
  "list_price": 9.00
}

If you implemented two separate collections and maintained these distinct documents, you would need to query the order items collection to retrieve the desired order item, and then perform another query on the products collection to obtain information about the product with product_ID 3648. This approach would involve two lookup operations to gather the necessary details for a single order item.

By denormalizing the design, you can create a collection of documents that require only one lookup operation. A denormalized version of the order item collection could be structured as follows:

{
  "order_item_ID": 834838,
  "order_ID": 8827,
  "quantity": 3,
  "cost_per_unit": 8.50,
  "product": {
    "product_description": "1 package laser printer paper. 100% recycled.",
    "product_name": "Eco-friendly Printer Paper",
    "product_category": "office supplies",
    "list_price": 9.00
  }
}

By incorporating the product details directly within the order item document, you eliminate the need for an additional lookup. This denormalized approach streamlines the retrieval process, resulting in improved efficiency and reduced query complexity.

Avoid Overusing Denormalization

Indeed, while denormalization can offer performance benefits, it should be used judiciously to avoid excessive redundancy and the inclusion of extraneous information in denormalized collections. The primary objective is to store data that is frequently accessed together within a document, enabling the database to minimize the frequency of reads from persistent storage, which can be relatively slow even with SSDs.

However, it is essential to strike a balance and avoid including unnecessary or irrelevant data in denormalized collections. Including extraneous information can lead to increased storage requirements, decreased query performance, and potential inconsistencies if the denormalized data is not properly maintained. Therefore, careful consideration should be given to determine which data elements are truly essential for efficient retrieval and meet the specific needs of the application.

By keeping denormalized collections focused on the relevant data that is frequently accessed together, developers can maximize the benefits of denormalization while avoiding the pitfalls of excessive redundancy and unnecessary data inclusion. This approach ensures optimized performance, reduced storage overhead, and consistent data integrity within the document database.

how much denormalization is too much?
When designing the document database considering the specific queries the application will issue is crucial. In this scenario, we have identified two types of queries: generating invoices and packing slips for customers (constituting 95% of queries) and generating management reports (constituting 5% of queries).

For invoices and packing slips, certain fields are necessary, such as order_ID, quantity, cost_per_unit, and product_name. However, product description, list price, and product category are not needed for these queries. Therefore, it would be more efficient to exclude these fields from the Order_Items collection. The revised version of the Order_Items document would appear as follows:

{
  "order_item_ID": 834838,
  "order_ID": 8827,
  "quantity": 3,
  "cost_per_unit": 8.50,
  "product_name": "Eco-friendly Printer Paper"
}

To retain the relevant product details, a separate Products collection can be maintained. Here is an example of a document in the Products collection:

{
  "product_description": "1 package laser printer paper. 100% recycled.",
  "product_name": "Eco-friendly Printer Paper",
  "product_category": "office supplies",
  "list_price": 9.00
}

Although the product_name field appears redundantly in both the Order_Items collection and the Products collection, this design choice enables application developers to retrieve the required information for the majority of their queries with a single lookup operation. While this approach may slightly increase storage usage, it optimizes query performance and enhances the efficiency of retrieving data for invoicing and packing slip generation.

Say No to Joins

While best practices, guidelines, and design patterns provide valuable guidance for building scalable and maintainable NoSQL applications, it is important not to adhere to them dogmatically. It is essential to consider the specific requirements and characteristics of your application. If breaking established best practices can offer improved performance, increased functionality, or better maintainability, it may be worth considering alternative design choices.

If storing related information in multiple collections is deemed optimal for your application, you can implement joins in your application code. However, it is crucial to be aware of potential performance implications, especially when dealing with large collections. Joining two large collections using nested loops, as shown in the example code snippet, can lead to significant execution times. For instance, if the first collection contains 100,000 documents and the second collection contains 500,000 documents, the loop would execute 50,000,000,000 times.

To optimize joins and reduce the overall number of operations performed, various techniques can be employed. These include utilizing indexes, filtering, and sorting. By leveraging indexes, you can speed up data retrieval by efficiently narrowing down the relevant documents. Filtering can help further refine the data set, and sorting can improve the join process in specific scenarios.

In summary, while breaking established best practices should be done with caution, it is essential to prioritize the specific requirements and characteristics of your application. If alternative approaches can provide superior performance, functionality, or maintainability, it may be appropriate to deviate from traditional design patterns and leverage optimization techniques such as indexes, filtering, and sorting to improve the efficiency of joins and reduce the number of overall operations performed.

Indeed, normalization is a valuable technique for reducing the risk of data anomalies, while denormalization serves a different purpose, primarily focused on improving query performance. In the context of document databases, denormalization is frequently utilized by data modelers and developers, similar to how relational data modelers employ normalization in their designs.

Another crucial consideration when designing documents and collections is the potential for document size to change over time. Documents that are prone to size changes are referred to as mutable documents. This aspect is worth noting because changes in document size can impact storage utilization, query performance, and overall system efficiency.

When mutable documents undergo frequent updates that modify their size, several factors come into play. These include the need to allocate additional storage space for the updated document, the potential fragmentation of data within the collection, and the impact on disk I/O and memory consumption during read and write operations.

To address the challenges associated with mutable documents, it is important to consider strategies such as document versioning, efficient update operations, and potential data reorganization techniques. Document versioning allows for tracking and managing changes to documents, enabling historical analysis and ensuring data integrity. Efficient update operations involve optimizing the way document updates are performed to minimize the impact on storage and performance. Data reorganization techniques, such as compaction or defragmentation, can be employed periodically to reclaim wasted space and improve overall storage efficiency.

By considering the potential for document size changes and implementing appropriate strategies, developers can mitigate the challenges associated with mutable documents in document database designs, ensuring optimal performance and efficient resource utilization.

Mutable Documents

Things change. Things have been changing since the Big Bang. Things will most likely continue to change. It helps to keep these facts in mind when designing databases.
Some documents will change frequently, and others will change infrequently. A document that keeps a counter of the number of times a web page is viewed could change hundreds of times per minute. A table that stores server event log data may only change when there is an error in the load process that copies event data from a server to the document database. When designing a document database, consider not just how frequently a document will change, but also how the size
of the document may change.Incrementing a counter or correcting an error in a field will not significantly change the size of a document. However, consider the following scenarios:
• Trucks in a company fleet transmit location, fuel consumption,
and other operating metrics every three minutes to a fleet management database.
• The price of every stock traded on every exchange in the world is checked every minute. If there is a change since the last check,the new price information is written to the database.
• A stream of social networking posts is streamed to an application, which summarizes the number of posts; overall sentimentof the post; and the names of any companies, celebrities, publicofficials, or organizations. The database is continuously updated with this information.
Over time, the number of data sets written to the database increases.How should an application designer structure the documents to handle such input streams? One option is to create a new document for each new set of data. In the case of the trucks transmitting operational data, this would include a truck ID, time, location data, and so on:

{
truck_id: 'T87V12',
time: '08:10:00',
date : '27-May-2015',
driver_name: 'Jane Washington',
fuel_consumption_rate: '14.8 mpg',
…
}

Each truck would transmit 20 data sets per hour, or assuming a 10-hour operations day, 200 data sets per day. The truck _ id, date,and driver _ name would be the same for all 200 documents. This looks like an obvious candidate for embedding a document with the opera- tional data in a document about the truck used on a particular day. This could be done with an array holding the operational data documents:

{
"truck_id": "T87V12",
"date": "27-May-2015",
"driver_name": "Jane Washington",
"operational_data": [
{
"time": "00:01",
"fuel_consumption_rate": "14.8 mpg"
},
{
"time": "00:04",
"fuel_consumption_rate": "12.2 mpg"
},
{
"time": "00:07",
"fuel_consumption_rate": "15.1 mpg"
},
...
]
}

The document would start with a single operational record in the array, and at the end of the 10-hour shift, it would have 200 entries in the array.From a logical modeling perspective, this is a perfectly fine way to structure the document, assuming this approach fits your query requirements. From a physical model perspective, however, there is a potential performance problem.When a document is created, the database management system allocates a certain amount of space for the document. This is usually enough to fit the document as it exists plus some room for growth. If the document grows larger than the size allocated for it, the document
may be moved to another location. This will require the database management system to read the existing document and copy it to another location, and free the previously used storage space.

Avoid Moving Oversized Documents

One way to avoid this problem of moving oversized documents is to allocate sufficient space for the document at the time the document is created. In the case of the truck operations document, you could create the document with an array of 200 embedded documents with the time and other fields specified with default values. When the actual data is transmitted to the database, the corresponding array entry is updated with the actual values Consider the life cycle of a document and when possible plan for anticipated growth. Creating a document with sufficient space for the full life of the document can help to avoid I/O overhead.

ahmed-tech.me

DEV Community

Best Practices in NoSQL Database Design

Denormalization

Avoid Overusing Denormalization

Say No to Joins

Mutable Documents

Avoid Moving Oversized Documents

Top comments (0)

Read next

Simplify Database Migrations using Python with Alembic

End-to-End Basic Data Engineering Tutorial (Spark, Dremio, Superset)

Redis reborn, a new chapter begins

Database for newbies: Hands On