Graeme Robinson for MongoDB

Posted on Aug 29, 2023

MongoDB Design Reviews: how applying schema design best practices resulted in a 60x performance improvement

#mongodb #database #nosql

The transition from working with legacy relational database systems to NoSQL databases such as MongoDB requires developers to change the way they model and represent data if they are to realize the full benefits of making the switch.

Whilst MongoDB has sometimes been referred to — incorrectly — as “schemaless,” the reality is that schema design is every bit as important in MongoDB as it is in any database system, and the choices you make as a schema designer and data modeler will make or break the performance of your application in MongoDB as much, if not moreso, than they will in any traditional RDBMS.

As a developer advocate on the MongoDB Strategic Accounts team, I assist customer development teams who are transitioning existing workloads or creating new workloads in MongoDB, by providing tailored data modeling sessions, also known as “design reviews.” During these sessions, we review the customer’s specific workload and provide feedback, advice, and suggestions on how best to model the data in their workload for optimal performance.

At the end of the design review, customers will have a framework schema design tailored to their workload based on best practice schema design patterns developed by MongoDB over the years.

In this article, we discuss a specific design review and show how, through a combination of schema design and query optimizations, the session resulted in the customer seeing a 60x improvement in the performance of one of their aggregation pipelines and allowed their application to meet its SLA targets.

The customer portfolio activity application

A few months ago, we received a request from a customer in the financial services industry to carry out a design review of an application they were building in MongoDB Atlas. Among other things, the application was designed to provide regional account managers with aggregated performance data for each stock in a customer’s portfolio over a requested time period, in a given region.

When the customer contacted us, the aggregation pipeline they had designed to generate the data was taking between 20 and 40 seconds to complete where the application SLA called for a sub two-second response time. The database design as far as this aggregation was concerned was relatively simple, consisting of only two collections.

The documents in the first collection contained customer information, including the region to which the customer belonged, and an array of stock symbols with one entry for each stock in their portfolio.



{
  "_id": "US4246774937",
  "region": "US",
  "firstname": "Jack",
  "lastname": "Bateman",
  "portfolio": [
    "NMAI",
    "CALA",
    "MNA"
  ]
}

The documents in the second collection contained stock data covering a minute of trading activity, with one document being produced for each stock. The information in these documents included the stock symbol, the volume of shares of that stock traded, the opening price, high price, low price, closing price, and start and finish timestamps for the covered minute.



{
  "_id": {
    "$oid": "63e15e9ad0c75e43cd1831db"
  },
  "symbol": "CALA",
  "volume": 10464,
  "opening": 0.14,
  "high": 0.14,
  "low": 0.14,
  "closing": 0.14,
  "start": {
    "$date": "2023-02-06T19:54:00.000Z"
  },
  "end": {
    "$date": "2023-02-06T19:55:00.000Z"
  }
}

The aggregation pipeline was being executed against the customer collection, with its output designed to provide the following:

“For each customer in a selected region, provide the opening price, trading volume, high price, low price, and closing price for a specified time period for each stock in that customer’s portfolio.”

To provide this output, the aggregation pipeline had been defined with five stages:

An initial $match stage that selected only documents for customers in the desired region
An $unwind stage that would duplicate the selected customer documents once for each stock in the customer’s portfolio — i.e. if the customer had 10 stocks in their portfolio, this stage would create 10 documents — one for each stock — replacing the original document that had all 10 stocks listed in an array
A $lookup stage that for each document created by the prior $unwind stage, would perform a second aggregation pipeline against the stock data collection, taking the stock symbol and date range as input and returning the aggregated data for that stock
A second $unwind stage to flatten the array created by the prior $lookup stage (that, in this case, always only contained one entry) down to an embedded object
A final $group stage to recombine the documents for each customer back into a single document

The output document for a customer with two stocks in their portfolio would look like this:



{
  "_id": "US1438229432",
  "region": "WEST",
  "firstname": "Otto",
  "lastname": "Cast",
  "portfolio": [
    "ISR",
    "CTS",
  ],
  "stockActivity": [
    {
      "opening": 0.42,
      "high": 0.42,
      "low": 0.4003,
      "closing": 0.4196,
      "volume": {
        "$numberLong": "40611"
      },
      "symbol": "ISR"
    },
    {
      "opening": 42.7,
      "high": 42.98,
      "low": 41.62,
      "closing": 42.93,
      "volume": {
        "$numberLong": "45294"
      },
      "symbol": "CTS"
    }
  ]
}

An examination of the explain plan for the pipeline showed that the two $match stages — one in the main pipeline and one in the sub-pipeline within the $lookup stage — were both correctly using the indexes set up to support the pipeline. This eliminated missing or incorrectly defined indexes — one of the most common sources of performance issues we see in MongoDB — as the source of the issues in this case.

Assessing the workload

Whenever we design a data model or schema for MongoDB, best practice calls for starting by understanding and quantifying the target workload. At the start of the design review, we were able to ascertain the following:

The application contained data on approximately 10,000 customers, evenly distributed across six regions within the United States.
Each customer had, on average, 10 stocks listed in their portfolio, with the highest number of stocks for any one customer being 20.
Activity data for approximately 16,000 stocks was being tracked.
Approximately two million stock activity records were being generated daily. Not every stock produced an update every minute, and data was only being collected for the eight-hour US market day, Monday through Friday.
During market hours, approximately 4,200 new stock updates were received each minute.
Stock activity data for the last four complete quarters, plus the current quarter to-date, was being maintained. This translated to approximately 650 million stock activity documents taking up around 35 GB of storage.
Data more than four complete quarters old was being purged from the system, so data volumes were pretty stable.
The data produced by the aggregation pipeline was being used by managers to produce end-of-day, end-of-month, and end-of-quarter reports. Data was not being requested for periods of less than a day, and the data for the current day only became available after the US markets closed.
Reports were being generated by managers, on average, 150 times per day.
The reports were not being used to make real-time trading decisions.

The application was being run on a MongoDB Atlas M40 three-node cluster, and the metrics above indicated that this workload should not be excessively sized for the cluster, with the only question mark being whether, at 16GB, there was sufficient memory to maintain an adequately sized working set of data in memory to handle the pipeline requests without data swapping taking place.

With an understanding of the nature and scale of the workload established, we then turned our attention to the structure of the aggregation pipeline.

Assessing the aggregation pipeline

The aggregation pipeline, as originally designed by the application development team, looked as follows:



[
  {$match: {region: "WEST"}},
  {$unwind:{path: "$portfolio"}},
  {$lookup:{
    from: "stockData",
    let: {
      symbol: "$portfolio",
      start: ISODate("2022-11-07T00:00:00.000+00:00"),
      end: ISODate("2022-11-08T00:00:00.000+00:00")
    },
    pipeline: [
      {$match:{
        $expr:{ $and: [
          {$eq: ["$symbol", "$$symbol"]},
          {$gte: ["$start", "$$start"]},
          {$lt: ["$end", "$$end"]},
        ]}
      },
      {$group:{
        _id: "$symbol",
        opening: {$first: "$opening"},
        high: {$max: "$high"},
        low: {$min: "$low"},
        closing: {$last: "$closing"},
        volume: {$sum: "$volume"}
      }},
      {$set:{
        "symbol": "$_id",
        "_id": "$$REMOVE"
      }}
    ],
    as: "stockData"
  }},
  {$unwind: {path: "$stockData"}},
  {$group:{ 
      _id: "$_id",
      region:{$first: "$region"},
      firstname:{$first: "$firstname"},
      lastname:{$first: "$lastname"},
      portfolio:{$addToSet: "$portfolio"},
      stockActivity:{$push: "$stockData"}
  }}
]

On running a test query retrieving data for a one-day window of trading activity for all customers in the “WEST” region, we saw a response time of just under 29 seconds.

There were two items in this pipeline that immediately caught our attention.

Firstly, the pipeline used an $unwind stage to allow the subsequent $lookup stage to be run once for each stock in each customer’s portfolio. In fact, this $unwind, and its subsequent reconstruction of the data in the final $group stage, was unnecessary. If an array is passed to a $lookup stage as the localfield value, the $lookup will automatically be run for each entry in the array. Refactoring the pipeline to take this approach reduced it to two stages: the initial $match stage and the subsequent $lookup stage. The revised pipeline looked like this:



[ 
 {$match:{region: "WEST"}},
 {$lookup:{
    from: "stockData",
    localField: "portfolio",
    foreignField: "symbol",
    let: {
       start: ISODate(
          "2022-11-07T00:00:00.000+00:00"
       ),
       end: ISODate(
          "2022-11-08T00:00:00.000+00:00"
       ),
    },
    pipeline: [
      {$match:{
        $expr:{ $and: [
          {$eq: ["$symbol", "$$symbol"]},
          {$gte: ["$start", "$$start"]},
          {$lt: ["$end", "$$end"]},
        ]}
      },
      {$group:{
        _id: "$symbol",
        opening: {$first: "$opening"},
        high: {$max: "$high"},
        low: {$min: "$low"},
        closing: {$last: "$closing"},
        volume: {$sum: "$volume"}
      }},
      {$set:{
        "symbol": "$_id",
        "_id": "$$REMOVE"
      }}
    ],
    as: "stockActivity",
 }}
]

Eliminating the $group stage in particular proved beneficial, and re-running our test query gave a response time of 19 seconds. This was a significant improvement, but still well short of the target sub two-second response time.

The second issue we saw in the pipeline was the use of the $lookup stage. $lookup essentially carries out the equivalent of what would be a left outer join in a relational database. Joins in any database system — relational or NoSQL — are computationally expensive operations. One of the key benefits of the document model used by MongoDB is its ability to allow us to avoid joins through the use of embedding and hierarchical documents. However, in this case, the application development team had correctly identified that embedding the stock activity documents in each customer’s document would lead to excessively sized documents and huge arrays — both MongoDB anti-patterns. Data denormalization and some level of duplication to improve query performance in MongoDB is often encouraged. However, in this workload, with write operations outnumbering read operations by a considerable margin, the extent and subsequent update cost of duplicating the stock activity data into the customer documents was determined to be a poor trade-off.

Although embedding the stock activity documents in the customer documents was ruled out as an approach, examining exactly what was happening with the $lookup stage was revealing in terms of understanding why the pipeline was taking as long as it was to execute. For example, running the pipeline to generate data for one calendar quarter for all customers in the WEST region resulted in the following metrics:

The initial $match stage returned 1,725 customer documents.
With an average of 10 stocks in each customer’s portfolio, the subsequent $unwind stage expanded the number of documents in the pipeline to 18,214.
The $lookup stage was then executed once for each of those 18,214 records.
For each execution of the lookup stage, one calendar quarter of data for a given stock needed to be aggregated. This resulted in approximately 25,000 one-minute stock activity records needing to be aggregated during each of the 18,214 executions of the $lookup sub-pipeline.
As the same stock could appear in multiple customers’ portfolios, in many cases, the $lookup sub-pipeline was being executed for the same stock multiple times.

During actual execution, the MongoDB aggregation engine would be able to apply some optimizations - in particular caching results of prior runs of the $lookup stage allowing them to be reused by subsequent runs supplying the same parameters - so the overall performance was not quite as high as the metrics would at first suggest, but it was still a lot of work being executed, some of which was duplicative.

With this understanding, the next stage in our design review was to look to see how schema design patterns could be applied to optimize the pipeline performance.

Applying schema design patterns — the computed pattern

The first issue we looked to address was the number of stock activity documents being aggregated by each execution of the $lookup sub-pipeline.

Stock activity documents were being written to the database on a minute-by-minute basis, but during our workload assessment at the start of the design review, we determined that users were never querying at anything less than per-day granularity. With this in mind, we decided to investigate if the computed design pattern could be applied.

The computed design pattern emphasizes pre-calculating and saving commonly requested data so that the same calculation is not repeated each time the data is requested. In our case, the pipeline was repeatedly aggregating the same per-minute data into daily, monthly, quarterly, or yearly totals. So, we decided to see what impact pre-calculating those totals and storing them in a new collection, and having the $lookup pipeline access those pre-calculated values, would have.

To do this, we suggested adding the following processes to the application:

At the end of each US trading session, the per-minute documents for each stock would be aggregated to give a “daily” document with the trading volume and starting, closing, high, and low prices for each stock. These “daily” documents would be stored in a new collection and the per-minute documents deleted from the original collection, meaning it never contained more than one day’s worth of per-minute documents.

At the start of each month, the “daily” documents for each stock would be aggregated to give a “monthly” document for each stock. The “monthly” documents would be stored in the same new collection as the daily documents.
At the start of each quarter, the “monthly” documents for each stock would be aggregated to give a “quarterly” document for each stock. The “quarterly” documents would also be stored in the same new collection as the daily and monthly documents.
In order to differentiate between the types of documents in the new collection, they would include a “type” field with a value of either “D”, “M”, or “Q” for “daily”, “monthly”, or “quarterly” respectively. This, along with the stock symbol and the starting date for the period covered, would form a compound _id value for each document.
Data for the four prior complete quarters plus the current quarter to date would be maintained. At the start of each new quarter, data for the oldest quarter would be deleted, preventing the size of the collection and its associated indexes from growing indefinitely.

An example of the new document design would look like this:



{
  "_id": {
    "symbol": "MDB",
    "time": {"$date": "2022-11-06T00:00:00.000Z"},
    "type": "D"
  },
  "closing": 218.51,
  "high": 218.9599,
  "low": 216.0501,
  "opening": 218.7,
  "volume": 336998
}

With these changes in place, it would be possible to form a query for any range of days, months, or quarters in the dataset. The metrics for the new collection designs were encouraging too. Tracking data for the same 16,000 stocks as before:

The new collection would contain roughly 5.4 million documents at most (i.e., just before a quarter end). This compared with roughly 640 million per-minute documents in the original stock activity collection.
The original stock activity collection, which would still be used to gather the per-minute updates, would only ever hold a maximum of two million documents (for the current day’s updates), rather than the 640 million documents previously.
A “worst case” query requesting data for the complete date range on the day prior to a quarter end would require 30 “daily” documents, two “monthly” documents, and four “quarterly” documents — a total of 36 documents — to be aggregated per stock. Compare this with the approximately 154,000 documents that would need to be aggregated per stock to do the same calculation using per-minute documents.

Modifying the pipeline to use this new structure, it now looked as follows:



[
  {$match:{region: "WEST"}},
  {$lookup:{
    from: "stockDataEnhanced",
    localField: "portfolio",
    foreignField: "_id.symbol",
    pipeline: [
      {$match: {
        $and: [
          {"_id.type": "D"},
          {"_id.time": ISODate("2022-11-07T00:00:00.000+00:00")}
        ],
      },
      {$sort:{"_id.time": 1}},
      {$group:{
        _id: "$symbol",
        opening: {$first: "$opening"},
        high: {$max: "$high"},
        low: {$min: "$low"},
        closing: {$last: "$closing"},
        volume: {$sum: "$volume"}
      }},
      {$set:{
        symbol: "$_id",
        "_id": "$$REMOVE"
      }}
    ],
    as: "stockActivity",
  }
]

Executing the revised pipeline gave a response time of 1800 ms — below our two-second target SLA! However, the design review team felt there were further improvements that could be made.

Applying schema design patterns — the extended reference and single collection patterns

Having resolved the problem of the large numbers of documents needing to be aggregated with each execution of the $lookup stage sub-pipeline, and simultaneously reducing the overall size of the database by almost 98%, we turned our attention to the other significant issue in the original pipeline: that in any given execution, the same aggregation calculations could be carried out multiple times for any given stock.

To address this, we revisited our understanding of the relationships in the data and how we wanted to represent those relationships — a step that, along with quantifying our workload and assessing its access patterns, and applying best practice schema design patterns — forms the basis of our approach to data modeling in MongoDB.

In this case, we were starting our pipeline with a $match stage on the customer documents to find all the customers in a given region because we needed their portfolio information and that’s where it was stored — a seemingly logical design.
However, looking at the way the data was being accessed, if we could add and maintain a list of regions for which a given stock would need to be calculated and add that to each pre-computed stock activity document for that stock, we could then initiate our pipeline against the pre-computed stock activity collection and, importantly, only aggregate data for each required stock once. Determining which regions a stock needed to be associated with would involve calculating the set of regions to which all customers holding that stock in their portfolio belonged.

Embedding the region data in the pre-computed stock activity documents like this is a variant of the extended reference schema design pattern. This pattern emphasizes embedding a subset of fields from related documents in a parent document so that all the related data can be retrieved with a single query and avoiding the use of a $lookup based join. Rather than embed the entire child documents, the pattern encourages embedding only those fields needed to satisfy a query predicate or those included in the query return. This helps keep the overall size of the parent document within reasonable limits.

Using the extended reference pattern comes at the cost of needing to propagate changes to child data to multiple parent documents, so the pattern is particularly useful when the referenced data does not change often. In highly normalized RDBMS designs, it is not uncommon to see joins being repeatedly performed to lookup tables that contain values that haven’t changed, in some cases, in decades.

In our workload, the pattern imposed the cost of possible updates to the regions each stock was associated with whenever a customer’s portfolio changed. But as this happened relatively rarely, the cost was deemed acceptable given the potential query performance improvements.

Applying these changes, the pre-computed stock activity documents now looked like this:



{
  "_id": {
    "symbol": "MDB",
    "time": {"$date": "2022-11-06T00:00:00.000Z"},
    "type": "D"
  },
  "closing": 218.51,
  "high": 218.9599,
  "low": 216.0501,
  "opening": 218.7,
  "regions": [
    {"region": "WEST"},
    {"region": "NORTH_EAST"},
    {"region": "CENTRAL"}
  ],
  "volume": 336998
}

The next question was, if we started the pipeline from the stock activity data, how would we then tie this back to the customer data? The application team’s first thought was to have another $lookup stage. But on further review, we suggested they instead utilize the polymorphic nature of MongoDB collections and store the customer documents within the same collection as the pre-computed stock activity data using a single-collection schema design pattern.

Single-collection patterns emphasize storing documents of varying types, but that are related and accessed together, within the same collection. By using a common set of attributes across all of the document types in the collection, and indexing those attributes appropriately, a single database search can retrieve all the related documents with a single database operation, saving on network round-trips and marshaling/de-marshaling overheads.

In our case, we elected to add customer documents to the stock activity collection using the following document shape:



{
  "_id": {
    "customerID": "US4246774937",
    "symbol": "NMAI",
    "type": "C"
  },
  "firstname": "Jack",
  "lastname": "Bateman",
  "portfolio": [
    "NMAI",
    "PUBM",
    "MNA"
  ],
  "regions": [
    {
      "region": "US"
    }
  ]
}

The key things to note with these documents are the “symbol” and “type” fields within the compound “_id” object, and the customer’s region being moved to a regions array. This made the field names and data types consistent with the stock activity daily, monthly, and quarterly documents. Note also we added one customer document for each stock within the customer’s portfolio. This allowed effective indexing of the data in the collection at the expense of some duplication of data. However, as the customer data changed relatively infrequently, this was deemed acceptable.

With these changes in place, we could now define a pipeline that both avoided repeated stock aggregation calculations, and avoided the use of an expensive $lookup stage. The stages in the revised pipeline were:

A $match stage to find all documents where the region array included the target region and either the the “_id.type” field was “C” for customer, or the combination of “_id.type” and “_id.time” indicated this was a stock activity document for the time period we were calculating. (The match stage query could be updated to include whatever combination of quarterly, monthly, and daily activity documents needed to cover any time period requested.)
A $group stage to aggregate the stock activity data for each stock, and also build an array of customer documents for each stock. As part of the data aggregation, an array of opening and closing prices from each of the individual activity documents being aggregated was built, relying on the index on the collection to ensure these were added to each array in chronological order.
A $set stage to replace the opening price array and closing price array built by the prior $group stage with the first and last entry in each array respectively to give the overall opening and closing price for the requested time period for each stock.
Finally, an $unwind and $group stage combination to reorganize the data by customer rather than stock, and to shape it to our desired output design.

The revised pipeline, run against the pre-calculated stock activity collection, now looked like this:



[
  $match: {
    $and: [
      {"regions.region": "WEST"},
      {$or:[
        {"_id.type": "customer"},
        {
          "_id.type": "day",
          "_id.time": ISODate("2023-02-07T00:00:00.000+00:00")
        }
      ]}
    ]
  }}
  {$group:{
    _id: "$_id.symbol",
    volume: {$sum: "$volume"},
    opening: {$push: "$opening"},
    high: {$max: "$high"},
    low: {$min: "$low"},
    closing: {$push: "$closing"},
    customers: {
      $addToSet: {
        $cond: {
          if: {$eq: ["$_id.type", "customer"]},
          then: "$$ROOT",
          else: "$$REMOVE"
        }
      }
    }
  }},
  {$set:{
    closing: {$last: "$closing"},
    opening: {$first: "$opening"}
  }},
  {$unwind: {path: "$customers"}},
  {$group:{
    _id: "$customers._id.customerID",
    region: {
      $first: {
        $getField: {
          field: "region",
          input: {$arrayElemAt: ["$customers.regions",0]}
        }
      }
    },
    firstname: {$first: "$customers.firstname"},
    lastname: {$first: "$customers.lastname"},
    portfolio: {$first: "$customers.portfolio"},
    stockActivity: { $addToSet: {
      symbol: "$_id",
      volume: "$volume",
      opening: "$opening",
      high: "$high",
      low: "$low",
      closing: "$closing",
    }}
  }}
]

A final test execution of this version of the pipeline gave a response time of 377 ms — more than four times faster than the application target response time.

Conclusion

As is our normal practice for design reviews, we had provided the application development team with a list of questions regarding the nature and sizing of their workload ahead of time, and they came to the session well prepared.

With this preparation completed, the design review session itself lasted around an hour, during which we went through our standard data modeling process of:

Assessing the workload and its access patterns.
Reviewing the relationships in the data.
Applying best practice schema design patterns.

By the end of session, we had collectively managed to improve the performance of the aggregation pipeline by 60x compared with its original design, handily exceeding the application target SLA, whilst simultaneously significantly reducing the storage requirements of the application. Everyone agreed it had been a very productive session.

Think your team could benefit from a design review session with data modeling experts from MongoDB? Please reach out to your account representative to find out more about booking a session with our data modeling experts, either virtually or at select MongoDB .local events in your city!

If you would like to learn more about MongoDB data modeling and aggregation pipelines, we recommend the following resources:

Daniel Coupal and Ken Alger’s excellent series of blog posts on MongoDB schema patterns
Daniel Coupal and Lauren Schaefer’s equally excellent series of blog posts on MongoDB anti-patterns
Paul Done’s ebook, Practical MongoDB Aggregations
MongoDB University Course, M320 - MongoDB Data Modeling