It's been a long while since I've written anything here. Admittedly, too long, for my own sanity. As I've mentioned before, this is somewhat therapeutic for me. To my defense, a lot has happened over the past few months. My daughter was diagnosed with CRMO, I moved my family to a new country where we're pursuing permanent residency, I took a new full-time (and then some!) job at a still-stealth FinTech startup, I've have been busy building a team and driving a pivot in our [pre-initial-release] architecture, and we're all currently in the midst of a global pandemic! No excuses; just lots of real life going on.
Anyway, I've had so many learnings come from the aforementioned pivot in architecture, and it's occurred to me more than once that some of the amazing lessons I'm learning would make for good topics here. Alas, I'm finally getting around to driveling on about one: finding the right balance between the "right way" to model DynamoDB and doing it within an event driven architecture.
If you want in-depth instruction on modeling DynamoDB, Rick Houlihan is the right person for the job. I've watched his AWS re:Invent 2019: Amazon DynamoDB deep dive: Advanced design patterns multiple times, and I have recommended it hundreds more times, it seems. Alex DeBrie (author of The DynamoDB Book) gives another great talk, and there are a number of other resources out there that will do that better than I'll even try to do. You may even notice some of DeBrie's key style in my examples here.
Similarly, if you're looking for an in-depth walk through of event-driven architectures, check out Martin Fowler, and for serverless, I'd point you at folks like Jeremy Daly and Yan Cui. Deep dives on any of those topics is beyond the scope of this post.
What I'm going to do here is talk about an intersection between these things; a balance between the "right way" to model DynamoDB and leveraging DynamoDB effectively in a serverless, event-driven architecture. And to do that, I'm going to talk a little bit about what that "right way" to model DynamoDB is, and why I choose not to follow that to a "T" when I build event-driven architectures.
But before any of that, what is DynamoDB? Fifty-thousand feet, it's a NoSQL, key/value and document datastore built for hyperscale (Netflix, Nike, Lyft Airbnb, and Amazon.com, to name a few) with crazy performance capabilities. You should use it when you need those things, and WHEN YOU KNOW THE ACCESS PATTERNS. If you don't know the access patterns, it's a horrible choice. Think OLTP, and definitely not OLAP. But for things like the back-end of a website or a mobile app, it's amazing. And even if scale isn't your driving factor, but you know your access patterns and have a serverless architecture, it avoids the pitfalls of things like connections to traditional relational databases.
That's a pitiful explanation of DynamoDB, but hey, the internet is installed on most computers now, so try that new Google fad for more information on that front. This writing is about DynamoDB modeling, and my personal opinion of when to do it not-quite-right, if you will.
Okay, with that out of the way, let's jump into the right way to do DynamoDB data modeling, (close to) by the book. Just to switch it up a little bit, let's not use the same old tired example of a customer buying products at a store. Instead, since I'm writing this with the 2020 draft playing in the background, let's use the NFL. I'm going to loosely start with a relational ERD:
It's worth noting that a relational model such as this should be the first step in the process of modeling DynamoDB, followed by identifying the access patterns of your application. For the sake of this post, I'm going to pretend my access patterns are as follows:
- By player, retrieve single most recent statistics
- By team, retrieve season statistics
- By team, retrieve game statistics
- By college, retrieve active players
- By game, retrieve score
- By week, retrieve schedule
There are inevitably a number of additional ways we might want to retrieve data, but I'll keep it to this list to retain some semblance of brevity.
And assuming we are going to follow the DynamoDB single table design, which is the recommended approach, we might end up with a DynamoDB table that looks like this (noting that I have a primary key called primaryKey and a sort key called sortKey):
To be sure, this is a first cut at one way we might do this, and I'm sure there are a number of things we might want to do to clean this up. Honestly, that's about a half hour of effort to get from the ERD to this, so it's certainly far from perfect. My point, though, is simply to illustrate the concept of denormalization based on access patterns, the use of overloaded, generic keys, and differing attributes within a "schema-less" table.
So, now we have a single table and we can access the data for all of the aforementioned access pattern requirements. We can avoid doing any joins (a NoSQL must), and we're avoiding doing any scans (a DynamoDB desire). All great things.
To be clear: real life relational models are significantly more complex than this simple example! Real life RDBMS systems often have a hundred tables (or more), not just the seven I've drawn out here. And real life applications have significantly more access patterns than the six I described! So, this is just about as simple an example as we can have, but it's not hard to imagine how much more complicated the overall solution might be with a more complex set of data and the relationships between them (I'm going to come back to this later).
Now, just for the sake of argument (note: you wouldn't want to do this!!!), let's imagine we've instead simply translated our relational diagram to a set of DynamoDB tables (again, you do not want to do this!!!). The reason we'll do so here is not about access patterns or single tables. Instead, we're going to do it because it makes our event driven architecture easier. Here's how:
Here, we're not focused on access patterns, per se. Rather, I'm illustrating the back-end of things a little bit. In this scenario, we're using the DynamoDB streaming events capability, triggering a Lambda function with each change made to the source DynamoDB tables. That Lambda function is very simple: read the incoming image, create message attributes based on the source table, and publish it to SNS (or EventBridge).
Then, subsequent Lambda functions can fire based on topic subscriptions to the SNS topic, using those message attributes. This is really simple and really straightforward. If a consuming Lambda function (on the right side of the SNS topic) needs to do something based on (e.g.) a change to the Player table, it is invoked only when the associated DynamoDB table changes. And simplicity is almost always my driving factor in architectural decision making.
Alternatively, one of two approaches is reasonable (one more than the other), using the single table model:
The first approach is the worse of the two: you could take every change to the single table and publish it to SNS, and then fire those downstream/consuming (right side of SNS) Lambda functions and perform the required filtering within those consuming functions. From a cost perspective, maybe this isn't prohibitive (and probably is), but it certainly isn't the most scalable approach, and it's unnecessarily complex. We're forcing unnecessary code to be added to the consuming functions.
I like the analogy of event-driven architectures where it's similar to an order coming in, and the person receiving that order announcing "I have an order", and the consumers who care about orders (fulfillment, shipping, etc.) doing something with that information. In this scenario, the analogy changes. Now, someone stands up and announces "I have something" and then everyone has to take that something, look at what it is, and decide if they care about it or not. It's clearly a worse approach.
One alternative to that approach is to add the logic into the Lambda function that fires on change to DynamoDB (the Lambda on the left side of SNS). This is markedly better! Now, the downstream/consuming Lambdas aren't asked to do too much. Everything comes into one place, but downstream consumers aren't asked to filter things to determine if they care. They can focus on just writing business logic (and, of course, the SNS topic subscription), and we're taking advantage of the recommended single table pattern. But...
That's a lot of pressure on that left-side (SNS publishing) Lambda function, for one thing. Lots of complexity there (especially if we have a real-life complicated ERD and many complex access patterns), and it's going to be frequently under construction and reflecting the frequent changes of a dynamic application, so there's some increased risk, too. We're taking advantage of the many benefits of a single table, but we're introducing complexity elsewhere. Remember, this Lambda function previously was as simple as possible (take input/create attributes from table name/publish to SNS). And I'll take that one step further: by taking this approach, not only is this full-of-complexity Lambda function going to be under change often, but so is the single table.
Is it prohibitively complex? Nah, but I do think it's easier to do this with one or two people than with a two-pizza sized team. With a team, we have a tendency (and need) to split work amongst different people, and when we have a single piece of work that sits in the middle of our critical data flow path, any mistakes can cause significant issues for everyone.
Additionally, even in scenarios where mistakes aren't being made, the single table pattern can have a tendency to cause problems. If we remember the group intercommunication formula made famous in the classic software book The Mythical Man Month (side note: I've weirdly mentioned things from this 1975 book more than once lately...):
n(n − 1) / 2
And recognizing that there are a number of ways to implement a NoSQL model to accommodate various access patterns, there will inevitably be debates and discussions, and required communications to determine how to best access the data they need from the single table design.
Finally, getting your head around data modeling for NoSQL is not easy. It's not just throwing a JSON object into a table. And speaking from experience, even after doing implementations of it and thinking I had it, I didn't. And I'm still learning, and still continuing to develop those skills. So there is some inherent complexity, and the learning curve is a steep one.
Now, all that said, I do like the pattern of the single table. I just don't necessarily like it as a single table.
What I am trying to say is that we should land somewhere between the extremes, when we're doing this for an event driven architecture. We shouldn't have a relational model (far from it!), but it's not a single table, either. Perhaps our earlier example becomes a small handful of tables, and maybe we'll push some of the complexities to the consuming Lambda functions (or even the publishing one, though I prefer keeping that as simple as possible). Maybe we create a teamPlayer table:
In this example, I purposely didn't change any of the rows that remain. And we can squint and imagine we've created another table for schedules and results of those games within that schedule. Now, we've done significantly better than the original seven tables, but we're also not using just one.
Now, consuming Lambda functions can filter on stats (team or player) versus schedule/results, and our publishing Lambda function can remain static and simple.
And as we imagine our ERD growing into the hundred-tables model that many real-life applications have, perhaps we end up with a handful of DynamoDB tables. The chart below isn't science; instead, it just attempts to generally illustrate that the number of NoSQL tables doesn't increase at the same rate that relational tables do.
What was the lesson I learned from the initiative I'm currently working on? That finding the right balance is tough, and I wish I'd have pushed for a little more single table pattern up front, but not too much. More importantly, I wish I had used more generic field names, and insisted on a sort key on every table, even if we were just replicating the primary key into it. We're far, far from a relational model, and I think generally we've done a pretty good job of finding that balance, especially recognizing that we not only have a two pizza team, but we're also distributed and working in (physical) isolation, due to COVID-19, and all of these things make this stuff all the harder to do perfectly right.
I've been racking my brain to figure out the right way to move forward, and the more I think about it, the more comfortable I am with where we are, but only if we're all aware of the patterns, recognize the benefits and drawbacks, and continue to build the right foundation (and build on that foundation). So, it's a matter of communication and a matter of training and learning. Then making the best decisions we can, with the information and time available to us. And from hereon out, generically starting with primaryKey and sortKey on all tables.
I'm going to mention one more thing. I have personally not built an application using GraphQL, but it's obviously made some impressive headway into the tech space since Facebook originally released it back in 2015. This is a query API approach, and with DynamoDB, it will make multiple calls, and we'll be undermining the benefits the single table approach intended to provide. Again, this is outside of my expertise, so I'd recommend doing some independent research to back this up, but I know enough to at least mention it here. So that's that.