The cover image is copyright Fabian Oefner from the Disintegrating II series. This is one of my favorite cars, the Ford GT40. In earlier days of CQRS, a web search for "cqrs" would auto-correct to "cars". The old CQRS blog humorously bears the subtitle: Did you mean cars?
Not terribly long ago, I had a great opportunity to implement a system using a few design patterns that I had been researching and playing with in my spare time. Fast forward to today. I've faced a few unforeseen problems and learned a few lessons from these. This post will address a specific piece: request/reply APIs.
I use a design pattern called Command/Query Responsibility Segregation (CQRS). For those not familiar with it, here is a quick summary of the API operations espoused by CQRS. I would also recommend checking out this article.
|Returns data||Makes changes|
Why use this pattern? I like it for a couple of reasons. As a consumer of the API, I never have to worry that asking a question will have unintended consequences. Conversely, I know exactly which API calls make changes to the system. There is no ambiguity. This makes the API easy to reason about. But historically, this pattern evolved because read and write concerns are often very different. And trying to create a unified interface to do both has the typical problems of serving two masters. The single interface becomes progressively more confusing over time for either purpose. Given enough time it is likely to form a cargo cult. "Why are we updating this field? We don't use it." Response: "I don't know, but keep doing it or something might break."
So at its heart, CQRS is just a specific application of Separation of Concerns, aka good organization practices. Now that we've made introductions for the pattern, I'll go over some of our implementation details and lessons learned.
I consider every Query or Command to be a message. Which means that any client system can represent these as ordinary data (classes or structs) with no methods. Then transmit them easily in wire formats such as JSON or CapnProto or w/e. Every message also has a name -- usually just the class/struct name -- which uniquely identifies it within the API. Such as
SearchCustomers (a query) or
DeactivateCourse (a command). Names are used to identify which operation was requested, then match it up with a message parser and a handling function. Security authorization can be as simple as keeping a list of which users are allowed to send which message names. Then checking that list before processing any user's message. 🤘🤘
If you are familiar with RPC, you can also look at Messaging as a superset of that pattern. The message name being the "procedure" and the message contents being the procedure arguments.
It may seem obvious how Commands and Queries should work. But there are some nuances that I discovered.
Well, queries are generally like you would expect. In particular, we handle them like this:
- API listens for
- Verify that user has
- Deserialize the query message
- Pass the query message off to its handler function, which will:
- Validate the query message
- Load and transform data from database
- Serialize and return the data
We tend to create queries that are tailored for specific pages or to answer common questions. It feels like we follow an inverse DRY rule here. If I need a query for a page, I might use an existing query. But only if I do not have to change the existing query. If changes are needed, then it means the new page has a slightly different responsibility even though it displays most of the same data. So I will make a new query instead.
The purpose of a command is to perform some business operation on the system. In practice, we noticed a distinction between whether or not the command needs to change one or multiple entities📌. How you handle multiple-entity changes is important for architectural reasons.
Entity in this case means something that is a logical unit. In heavily normalized tables, an entity might include a parent and any descendants of 1-to-many relationships. In DDD terminology, you might call this an aggregate. In event sourcing, this is an event stream.
You could execute multiple-entity changes in a single transaction to achieve all-or-nothing semantics. This approach is nice to work with in code, but it limits scalability. To be involved in a transaction, all affected entities have to be located on the same database node. If they are on different nodes, then a distributed transaction occurs (if supported by the database). And as load increases, distributed transactions will get progressively slower. Cross entity transactions are a valid approach for internal business applications (or any application which is not likely to outgrow a single database node). But for publicly available internet services, perhaps not.
A more scale-friendly approach is to only use single-entity commands to make changes. When a use case requires changes to multiple entities, use a meta-command which makes no changes itself, but instead orchestrates and runs single-entity commands. I call the single-entity commands "basic commands", and the multiple-entity ones "workflows".
⚠️ These are not back-end workflows
Workflow commands are a convenience for the front-end. They typically consist of individual actions the user could take themselves through the UI. But instead of making the user go to multiple pages, we present a single form and roll up all the data needed into a workflow command. These are best effort and time-boxed (due to being request/reply), so failure typically just results in an incomplete workflow which the user can retry or fix the remaining items individually. These workflows are not meant to replace back-end processes or to provide robust handling of failure cases.
You could implement a Workflow on the client side -- have the UI orchestrate all the necessary basic commands. However, I choose to make them API-side for one main reason: clarity (of security especially). I'll illustrate with a real example from our system. We have a Trainer role. This role is not allowed to Create Courses. However, they can Record Training they provided to employees. Part of the Record Training use case may include creating a new course with limited options. By executing the Record Training use case as an API workflow, it can be expressed as a single granular permission. "Trainers can Record Training but not Create Courses." As in, one box is checked on the permissions UI, but the other isn't.
To do the same thing from the client side, we would need to add a basic command: Create Trainer Course. Then admin users would have to be informed: "To give someone permission to record training, you have to check
Create Trainer Course and
Permission X and
Permission Y." So then client-side workflows like this are a documentation/end-user-training burden. We could also create a fake command just for permission purposes, which maps to the required basic-commands. This would instead burden devs with extra stuff to keep updated. I don't like either of these outcomes, so I prefer API-side workflows.
Update 11 Sep 2021
For running batches of the same kind of commands, we have used some client-side workflows. The client makes a list of commands. Then sends them, either one-at-a-time or in parallel. Then marks them off as success responses come back. This also makes it easy for the client to retry only commands that fail.
The downside: this approach is "chatty" -- it requires a round-trip communication to the server for every command. This increases server/network load versus the "bursty" server-side workflow. Additionally, users with high network latency will see one-at-a-time client workflows slow to a crawl. Ask me how I know.
The server will be doing the same amount of work per command whether you send them separately or as a burst. However, the server also uses cpu and memory to send/receive network requests. So more communication means less resources available for back-end work. How much less depends on the kinds of offloading supported by the server hardware.
Client workflows can make sense if you use them judiciously.
There are some very common questions people ask when implementing CQRS APIs. I will list the principles I have come to follow as headings, then detail the common questions behind them.
A popular misunderstanding is that commands should return nothing at all. This stems all the way back from the CQS pattern (which CQRS just extends). This pattern was applied to an object and its methods inside specific languages. Many languages use exceptions as the error propagation strategy. A "command" method was especially noted by the fact that it returns
void. So the notion was born that commands return nothing. However, it is implied that an error will throw an exception, which is really just a different return path.
So the truth of the matter is that commands do return something. They return meta-information about the operation itself (whether it succeeded or failed and why). This is very different from returning business data, which is the job of Queries.
Commands can make 0 or more changes. In other words, "making changes" is the purpose of a command, not the required outcome. So it is entirely valid for a command to run successfully but result in nothing changed.
We have cases like this where we compare an entity before and after running a command. If they are exactly the same, then we choose to make 0 changes and return successfully.
A lot of questions come up based on the misconception that CQRS principles should apply to the insides as well as the outsides of an API. Specifically there are a lot of questions about whether or not it is ok for command handling code to run a query. Instinctively, this seems like a violation of CQRS principles. But CQRS only makes recommendations about the external surface area of the API. The insides of a command are implementation details on which it holds no opinion other than "makes changes".
So feel free to run queries to grab some information needed to make decisions within a command. One caution though. It is common in more advanced scenarios that the query data may have come from a cache or otherwise may be lagging behind what is "current". (Often referred to as Eventual Consistency.) In this case, you must consider the effect of stale data from the query on the decisions your command makes. See more on that here. It could be that slightly stale data won't matter, such as is normally the case with configuration data. Example: when a user changes configuration data, they expect that some things happened under the old configuration, but future things will happen under the new configuration. They will probably not notice or care that a user slipped in an operation under the old config during the few hundred milliseconds of eventual consistency after they made the change. They will just assume the operation was executed before their change.
A common objection to commands not returning data is: I need to return the auto-generated ID. Auto-increment IDs are very convenient, but they have significant trade-offs. They don't scale for one thing, and they have security concerns for another. But let's ignore that for a moment and focus on a common usage issue: retries.
A user fills out a form to create a new entity and hits Submit. The request times out.
If an auto-increment field is your only ID, your app has no way of knowing whether the request succeeded. The remedies to this situation typically depend on user awareness and participation.
If the user just hits Submit again (very likely), but the previous request did create the entity despite the timeout, then there are now two of the same entity with different IDs. To properly cleanup, the user should now search for duplicates and remove the redundant entity (highly unlikely).
Alternatively, after a timeout, the user could search for their maybe-created entity. And if they fail to find it, come back and fill out the form again. This scenario is not likely in my experience. Maybe it could happen if you add training costs to get users accustomed to thinking this way.
You could add in external systems of duplicate checking, such as keeping a memory of seen operations and their results. But there is a better way...
An ID was generated (or requested from the server) when the form was loaded, before the user even started typing anything.
After the user is informed of the request timeout, she just hits Submit again. The UI sends off the same exact same request as before with the same pre-generated ID. In the best case it succeeds as normal. In the worst case the API responds with: "This entity already exists." And if the UI can identify this specific error, it can just pretend it succeeded as normal. This adventure results in a better user experience and no chance of duplication.
We tend to use UUIDs for all identification purposes. They are easy to generate on many platforms. They defy trend analysis. Most of our creation forms have to run a query anyway (for example, to get drop-down list data), so we just include a new UUID in the results too.
Update 11 Sep 2021
The above works well with internal APIs that we consume ourselves. But as we have gotten into external APIs, we are considering a different strategy. Especially for operations which create new entities (e.g. Create Order). We cannot trust external clients to provide a unique ID matching our constraints. Or even to regurgitate an ID they got in a previous query.
Instead, we are looking at using the client-provided ID as reference data. When an entity is created, we will generate our own ID for it. But the client's ID will be attached to the entity and indexed for lookup. The client can use its own ID to call our API. But ID that we depend on internally still meets our standards.
Another scenario is where the client does not have its own ID but uses ours. The approach still works if slightly modified. The client provides a request ID. Once an operation completes, the client can ask for information about the created entity (including our ID) using the request ID.
Commands are the gatekeepers of change. Queries are the library of knowledge. That's CQRS. I have found that this pattern leads me in the right directions. It is also a versatile pattern. It doesn't care if your deployment surface area is monolithic or micro. You can even split commands and queries into their own separate services to scale out read loads separately from write loads.
But bear in mind that this is just one piece in a larger system, not a tool for every job. The CQRS pattern works well at the border of a back-end system, interfacing with client applications. As with any pattern, it will only be useful when applied in the right situation.