Originally posted on: Big ball of mud
I came recently to design a solution for MongoDB to store information that has time as the main axis of analysis. This information should be stored in a way that was easy enough to query and aggregate using many different time granularity (months, weeks, days, ...). Information should also be stored in a way that does not consume too much disk space and was optimal in performance for MongoDB to maintain. In a word, I need to transform MongoDB in a Time series database.
Time series
Let's start from the beginning. What is a time series. Citing Wikipedia:
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
InfluxDB Key Concepts page gives us an extremely easy example to understand of what a time series is. Imagine you have two scientists that have to record the number of two type of insects in two different locations. Using a tabular view, we come across something like this.
time butterflies honeybees location scientist
------------------------------------------------------------------
2015-08-18T00:00:00Z 12 23 1 langstroth
2015-08-18T00:00:00Z 1 30 1 perpetua
2015-08-18T00:06:00Z 11 28 1 langstroth
2015-08-18T00:06:00Z 3 28 1 perpetua
2015-08-18T05:54:00Z 2 11 2 langstroth
2015-08-18T06:00:00Z 1 10 2 langstroth
2015-08-18T06:06:00Z 8 23 2 perpetua
2015-08-18T06:12:00Z 7 22 2 perpetua
As you can see, the main dimension of this data mart is time. Events stored in it span through time. What we can refer to as basic time granularity is set to minutes in our example. This means that we cannot produce an analysis of these data that has a granularity less than the basic (minutes).
Columns like butterflies
and honeybees
are called fields. Fields are made up of field keys and field values. Field keys (butterflies
and honeybees
) are strings and they store metadata; the field key butterflies
tells us that the field values 12
-7
refer to butterflies and the field key honeybees
tells us that the field values 23
-22
refer to, well, honeybees.
Field values are the data; a field value, in a time series database is always associated with a timestamp. The collection of field-key and field-value pairs make up a field set.
Last, but not least, columns like location
and scientist
are called tags. Also in this case, a tag is made up of a tag key and a tag value. We can look at tags like indexes, that help the access to the time series. They are not
mandatory, but they helps a lot.
We have time series, now. But, are these database only a bunch of tables that forces the presence of a timestamp? No, the main characteristic of a time series database is that it should have powerful tools to aggregate data (fields) over time. Let's say, if we need to know how much butterflies were counted by scientist "perpetua" in "location 1" during the last year, it should be easy to retrieve this information from the database.
Time series in MongoDB
MongoDB is a general purpose document oriented database. This means that information inside the database is stored as document. MongoDB uses BSON format, a binary variant of JSON documents. A document in MongoDB looks like the following.
{
_id: "yuebf09e-e0ewrewr-wererwer-324324edd",
name: "Riccardo",
surname: "Cardin",
hobbies: ["computer science", "karate", "bass guitar"]
//...
}
Among the types that are available for field values we find string, numbers, timestamps, arrays, documents, and so on. How can we design a document schema such that MongoDB can manage time series?
For a number of reasons that will be analyzed later in this document, the best way to treat time series in MongoDB is using a subdocument for each level of aggregation we want to manage. Let's convert the above example about scientists and insects to MongoDB.
Suppose that you want to maintain minutes as the maximum granularity over time. Then, imagine that you also want to give access to a partially aggregate information in hours, days and months. The final optimal document schema you should use is the following.
{
_id: "1234567890",
// The document is indexed by month
date: ISODate("2015-08-01T00:00:00.000Z"),
// In the main document we fing tags
location: 1,
scientist: "langstroth",
days: [
// We have a subdocument in the days array for every day in the month
{
date: ISODate("2015-08-18T00:00:00.000Z"),
// In each subdocument we find field keys and values
butterflies: 23,
honeybees: 51,
hours: [
// We have a subdocument for every hour in a day
{
date: ISODate("2015-08-18T00:00:00.000Z"),
butterflies: 23,
honeybees: 51,
minutes: [
{
// And finally, we find the minutes granularity. Also here
// we have a subdocument for each minute in an hour
date: ISODate("2015-08-18T00:00:00.000Z"),
butterflies: 12,
honeybees: 23
},
// Omissis...
{
date: ISODate("2015-08-18T00:06:00.000Z"),
butterflies: 11,
honeybees: 28
},
// Omissis...
]
},
// Omissis...
]
},
// Omissis...
]
}
Such a big document, isn't it? As you can see the trick here is to have a subdocument level for granularity we need in our analysis. Tags are in the main document, let's say at level 0. Fields are partially aggregated at each level (1, 2, ...). The aggregation over time is determined by the value of the date
property at each level. Documents are always complete. This means that we will find a subdocument for each minute / hour / day, whether the fields value are 0 or not.
Why this? Why that?
So far so good. Now, the question is: why do we use this complex document schema? Which are the pro and cons?
First of all, if we model our event using an 1:1 approach with respect to the documents we would end up with one document per event.
{
_id : "1234567890"
date: ISODate("2015-08-18T00:00:00.000Z"),
location: 1
scientist: "langstroth"
butterflies: 12,
honeybees: 23
},
{
_id : "0987654321"
date: ISODate("2015-08-18T00:06:00.000Z"),
location: 1
scientist: "langstroth"
butterflies: 11,
honeybees: 28
}
While this approach is valid in MongoDB, it doesn't take advantage of the expressive nature of the document model. Moreover, to aggregate results that span through an interval, using the subdocument approach, MongoDB needs to access to very few documents.
Another good question is why we are using arrays to days / hour and minutes information, instead of using dedicated JSON property for each element. Arrays works very well with the MongoDB Aggregation Framework. In detail, using the $unwind
operator, it is possible to flatten the internal structure of each document, turning into an easy job the querying process also of information stored inside subdocuments.
For example, using the following aggregation pipeline it is possible to easily retrieve the number of butterflies reported by scientist langstroth, in location 1, during the days between 2015-08-18 and 2015-08-20.
db.test.aggregate([
{$match: {location: 1, scientist: 'langstroth'}},
{$unwind: '$days'},
{$match: {'$and': [{'days.date': {'$gte' : ISODate("2015-08-18T00:00:00.000Z")}},
{'days.date': {'$lte' : ISODate("2015-08-20T00:00:00.000Z")}}]}},
{$project: {_id : 0, location: 1, scientist: 1, 'butterflies' : '$days.butterflies'}},
{$group: {_id : {location: '$location', scientist: '$scientist'}, butterflies: {'$sum': '$butterflies'}}}
])
In the example we are using levels of subdocuments, i.e. days, hours and minutes. Clearly, all these levels are not mandatory. However, in this way we can increase update performance when updating the document. Any array is indeed smaller because it contains only a small piece of the whole information. Then, MongoDB can walk faster through the array during update process.
Another important thing is that the main document must be inserted into the collection in its full form, which means with all the levels of granularity already filled. Clearly, at the beginning, all the fields values in each subdocument will be equal to zero. However, this is an important requirement to take into consideration. In this way, no update will cause an existing document to grow or be moved on disk. This fact allows MongoDB to perform better on the collection.
But, this requirement opens an important issue about the management of time series using MongoDB: who is responsible to insert the "all zero" document for each tag set inside the collection?
Which came first, the chicken or the egg?
This is the real and central issue using MongoDB to model time series. We need a procedure that inserts the documents before we can use and update them.
First attempt: we can develop a process that periodically inserts for us those documents. Nice try, dude. However this approach is not possible for those use cases in which the domain of the tags is not known a priori. Returning to
out example, imagine that your system is collecting butterflies and honeybees number from the scientists all over the world. It is impractical to know the name of all these scientists.
Second attempt: try to take some advantage using the $setOnInsert
clause in a $update
(upsert) statement. From MongoDB documentation we have:
If an update operation with upsert: true results in an insert of a document, then $setOnInsert assigns the specified values to the fields in the document. If the update operation does not result in an insert,
$setOnInsert
does nothing.
Bingo! We found it! We can insert the whole documents in a $setOnInsert
clause the first time we try to update the collection. Nope. Due to a bug explained in this Stackoverflow question it not possible use the same property both in the $setOnInsert
and $update
clauses. S**t!
Three step initialization
Then, do we reach a dead end, a cul-de-sac? At first sight it may seem so. Fortunately, me and my colleagues found a workaround. We can call it three step initialization. Just for recap, we want to be able to initialize the main
document with all the required subdocuments properly set.
Remember that MongoDB guarantees the atomicity of operations on a single document. With this fact in mind we can operate in the following way:
- Try to update the document, incrementing properly the counters at a specified time chunk. Do not do any upsert, just a old-fashioned update operation. Remember that the execution of an update statement returns the number of documents written. If the number of documents written is greater than zero, you're done.
- If the number of documents written by the update is zero, then it means that the relative document to update is not yet present in the collection. Try to insert the whole document for the specified tags. Put all the counters (field values) to zero. Also the execution of an insert statement returns the number of documents written. If it returns zero or throws an exception, never mind: it means that some other process had already insert the document for the same tags.
- Execute the same above update again.
The code should looks like something similar to the following code snippet. Here we want to add 1 to butterflies and honeybees field values for the date 2015-08-01T00:06:00.000Z, and tags location 1 and scientist "langstroth".
// Firt of all, try the update
var result = db.test.update(
{date: ISODate("2015-08-01T00:00:00.000Z"), location: 1, scientist: "langstroth"},
{$inc :{
butterflies: 1,
honeybees: 1,
"days.0.butterflies": 1,
"days.0.honeybees": 1,
"days.0.hours.0.butterflies": 1,
"days.0.hours.0.honeybess": 1,
"days.0.hours.0.minutes.6.butterflies": 1,
"days.0.hours.0.minutes.6.honeybess": 1,
}
},
{upsert: false}
);
// If the update do not succeed, then try to insert the document
if (result.nModified === 0) {
try {
db.test.insert(/* Put here the whole document */);
} catch (err) {
console.log(err);
}
// Here we are sure that the document exists.
// Retry to execute the update statement
db.test.update(/* Same update as above */);
}
Clearly, what makes the above procedure working is the guarantee of atomicity on document modification. I know, the procedure is a little bit creepy, but we did not find anything better at the moment. Do you have any better idea? If so, try to explain it in the comment section. Thanks!
References
- Time series
- InfluxDB Key Concepts
- Schema Design for Time Series Data in MongoDB
- MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
- MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Aggregation Framework and Hadoop
- MongoDB Aggregation Framework
- MongoDB Collection update: initialize a document with default values
Top comments (7)
Hello Riccardo
First of all congratulations for your post!
I want to ask you something. I am importing data in my mongodb collection.Every file contains 105.000 rows and we have in total 11 files.So we have 1.2m rows
The code i used for that is this:
The problem is that this is very slow.Do you know how can i have faster insertion?
Thanks in advance!
there is now native support for time series starting from mongodb 5.0; did you test that yet? mongodb.com/developer/how-to/new-t...
Thank you very much for sharing this! 🎉 I've tried to build a Time Series of my own and had a drawback that I'd like to know what you think about. I wrote a post about it here
Hi, why don't you save one step by doing the insert right away?
According to the doc insert won't return 1 if the document was existing, and so you can do the insert, then the update.
In fact, you might even not have to check the returned value of the insert, as it doesn't matter.
What do you think?
-- Constantin
Hello Constantin,
in my scenario the use case in which the document will be already available is far away the most frequent. So, in the majority of cases, step one will be the only one that the procedure will take.
Under these conditions, the approach described in the article should be the optimal one.
Cheers,
Riccardo
Riccardo, thank you so much for those precious advices.
Cannot tell how much they will be useful for us.
You're welcome!