tl;dr CQRS is a more complex pattern to implement than traditional single-instance (e.g.) RDBMS CRUD solutions, but the concepts are not as daunting as many of the articles on the topic might lead us to believe. It is not an architecture; it is a pattern that can be implemented within various architectural approaches, leveraging various technologies to accomplish the objectives of the pattern. That doesn't mean it's easy to implement and maintain, but it's a very viable solution that needn't be overly-difficult, either. This post attempts to simplify the concept and discuss various technologies within an architectural approach that can lead to a cost-effective, secure, and scalable solution.
CQRS stands for Command Query Responsibility Segregation (CQS is synonymous; Responsibility is implied). Simply put, it is literally just a pattern that separates data object reads from mutations of those data objects. By segregating these responsibilities, you can manage both independently, which provides you the capability of individual "right-sizing" of hardware and computational resources, as well as managing security independently. The quote below does a good job of explaining both what CQS is and more specifically, what it is not:
“CQRS is a very simple pattern that enables many opportunities for architecture that may otherwise not exist. CQRS is not eventual consistency, it is not eventing, it is not messaging, it is not having separated models for reading and writing, nor is it using event sourcing.”Greg Young
The aforementioned link further provides a good practical example of the simplicity of the CQS/CQRS pattern, where a simple customer service is shown:
void ChangeCustomerLocale(CustomerId, NewLocale)
Applying CQRS on this would result in two services:
void ChangeCustomerLocale(CustomerId, NewLocale)
If we look at this through the lens of a traditional RDBMS based CRUD approach, all that has occurred here is to separate the R from the C, U and D. One service is responsible for reading the data, whereas another is responsible for creating, updating, and deleting (which I'll refer to collectively as mutating). The model of the data object (in this case, Customer) has not changed; instead the actions applied to the object have been segregated.
Normally, the biggest concern we hear with an approach like this is around data staleness, and synchronizing the data between multiple services, since an update to one likely isn't immediately accesible in the other. This is a legitimate concern, and part of why this pattern introduces some complexities, which we will break down in more detail within this post.
First, let's debunk an often misunderstood data reality, regarding staleness: as soon as data is exposed to a user, it is potentially stale. It is potentially stale because a separate process could be updating the data once it has been accessed. A good example of this is provided here:
Let’s say we have a customer service representative who is one the phone with a customer. This user is looking at the customer’s details on the screen and wants to make them a ‘preferred’ customer, as well as modifying their address, changing their title from Ms to Mrs, changing their last name, and indicating that they’re now married. What the user doesn’t know is that after opening the screen, an event arrived from the billing department indicating that this same customer doesn’t pay their bills – they’re delinquent. At this point, our user submits their changes.Udi Dahan
Should we accept their changes?
Well, we should accept some of them, but not the change to ‘preferred’, since the customer is delinquent. But writing those kinds of checks is a pain – we need to do a diff on the data, infer what the changes mean, which ones are related to each other (name change, title change) and which are separate, identify which data to check against – not just compared to the data the user retrieved, but compared to the current state in the database, and then reject or accept.
Data staleness is a reality whether we're using a traditional CRUD or CQS approach (or any other approach; it's simply a characteristic of the passage of time and stateful data). If we can accept that reality, we can more easily accept the decoupled service approach of CQS. How often the data is synched (in the above scenario, how quickly a refresh of the customer support screen would reflect the accepted changes) is the challenge of eventual consistency, and the frequency and timing of that synching can be accomplished in various ways, using various technologies (a couple of which we'll be exploring below), but again, data synchronization between disparate data objects in and of itself is a technology and architectural decision that should be considered as an implementation of the CQS pattern.
Sidenote: here, I'm talking about eventual consistency as two disparate data structures completely. I'd be remiss if I didn't quickly speak to it in terms of eventual consistency within a distributed system, as is the case with these types of systems, and often the topic of the "C" in the CAP theorem.
Within a distributed system, consistency is often a data value propagating through the nodes on a cluster. In those types of scenarios, it is possible to retrieve a value during a query (for instance, retrieving the value from node0, then performing that same query and returning an out-dated value, perhaps from node26, which hasn't received that update yet). In the scenario I've outlined here, the occurrence of eventual consistency is partially amplified, because data has to propagate through from one system (e.g. a RDBMS) to another (e.g. a document data store), with the latter being a distributed system that has eventual consistency handling of its own.
I certainly don't mean to discount this as a characteristic of a distributed system; rather, I'm simply pointing out that eventual consistency can be a characteristic of the overall platform/system in addition to the characteristics of a distributed system.
The most common RDBMS based approach to implementing CQS is by separating the reads from data mutations using database views. The approach, then, would be to read from views, and the mutations happening on the underlying tables themselves.
This seems straightforward enough; to be sure, it's a pattern we've witnessed in the RDBMS space since practically the beginning of RDBMS. The problem with this approach is that you cannot easily scale the two independently. Out-of-the-box, accessing a view is going to consume from the same pool of computational resources as data mutations, so the two cannot be scaled independently, and security of the separate objects must be applied at a database resource level, which is certainly possible, but unnecessarily complex to keep updated, as additional tables and views are inevitably added over time.
So the complexities around the CQS pattern are not all that daunting. We're just separating responsibilities. In the above example, we didn't alter the model in any way, we didn't introduce eventing or event sourcing, and don't have to worry about eventual consistency, since any data mutation would immediately be reflected in the next read of the data.
However, where the complexities are introduced, and where we start to see things like varying models, eventing and eventual consistency (amongst the others) are in the practical applications of the pattern. As we walk through one (of many) practical deployments here, we will examine how all those things can come into play.
Perhaps we want to accomplish data access (the R in CRUD) using a document datastore, like DynamoDB, because it gives us the ability to (for all practical purposes) infinitely scale, and manage security independently. This is a common technology for this type of usage, in no small part because of the performance capabilities of the application, but because the access pattern for reads are often well understood. Obviously, doing things like data exploration for analytic purposes is an exception here; that type of exploration does not have a well-defined access pattern, and a traditional RDBMS is better suited for that type of access, but for the customer support example above, or to publish a hello, your name here upon successful sign-on to a website (for instance), the access pattern is well understood. The image below shows this access pattern:
While it could, and often is, a separate actor that performs data mutations, for the sake of this post, we'll imagine them to be the same. To be clear, in our example, the user (depicted as a laptop in the above image) can update information about themselves, such as their physical and/or email address. Here, we'll introduce some event processing using a messaging approach. The user signs in, at which point certain information is retrieved (e.g. hello, your name here), and then they update their name, which passes a message to a queue, or in this case a SNS topic via a Lambda call, which in turn initiates a separate Lambda call (an event based on the arrival of a message) that updates a RDBMS table, depicted here as an Aurora RDS instance:
The responsibilities between query (reading data) and data mutation have now been separated, by using a set of technologies and applications (DynamoDB and Aurora) and an event-driven architectural approach (Lambda), via messaging (SNS). This further allows some classic ETL type functionality, such as data cleansing, validation, conforming, transformation, etc., to occur between the read from SNS and the write to the RDBMS. Additionally, we can now scale the various components independently. DynamoDB will scale based solely on the access requirements of the reads, and the RDBMS can be scaled based on the mutation requirements of the writes. It is worth noting here, that this approach has in fact introduced separate models for the data; DynamoDB is a document datastore, so it more closely resembles a JSON object, comprised of (potentially nested) key/value pairs, as compared to the more rigid and flattened table structures within a RDBMS. Again, this is not a characteristic of the CQS pattern, but this implementation of the pattern, and specifically because of the technologies and applications chosen for that implementation, we do in fact see separate models in this instance.
With regard to security, the approach is also simplified versus the table/view approach discussed previously. We can set read access on DynamoDB differently than we do the security of the Lambda that performs the ETL and subsequent writes to the RDBMS.
However, the problem of eventual consistency has not been addressed, because no synching is shown to keep the data between Aurora and DynamoDB consistent.
Because the Aurora application enables us to trigger events based on database changes, we can accomplish this remaining requirement by extending our event-driven approach, continuing to leverage Lambda and SNS accordingly:
One of the strengths of taking the approach illustrated here is in the timing of events. Because we've taken the event-based approach shown here, we've minimized the latency between a user initiating a mutation (e.g. an update of their email address) and the ability to expose that information back to the user. Here, an event such as that, depending on the extent of the ETL, can happen sub-second.
An alternative solution here might be to use something like SQS as the queueing service between Aurora and DynamoDB, which would allow you to minimize write connections to DynamoDB; you could, for instance, choose to performs updates in mini-batch (perhaps every 5 minutes), applying any mutations that have accumulated since the last execution. While I personally prefer the event-driven SNS approach that minimizes latency, but at the expense of costs (because it allows for concurrent DynamoDB writes based on how frequently Aurora mutations occur, and thus could impact the required provisioned write capacity units), this approach is shown below, and the mutations are managed via a Cloudwatch timed event:
Finally, we'll extend the diagram with some additional access points for data visualization and data exploration, both of which are better served from the RDBMS, in this case leveraging the native database technology of SQL:
In this post, we've (hopefully) simplified the CQRS pattern and reduced confusion around what the pattern is and isn't, as well as shown a viable technology and architectural approach in which this pattern can be applied. To be sure, this approach introduces a number of complexities over a traditional single-server approach, but the reality is that many real-world problems have certain requirements that are not best met with traditional approaches, and CQS allows us the ability to address those complexities while continuing to exploit the cloud pay-for-what-you-use pricing model, while also managing security at an individual service level.