I introduced our company to OKRs this year as a way for everyone to be better aligned and to increase transparency across our work, and as part of that, created this key result for myself:
Introduce a framework for building out [our company]-sized applications within a 6 month timeframe (design to public launch) by September 30, 2020
Basically, the initiative I've been working on has taken longer than we'd like (it always does), and while I understand that we're paying a pioneer tax as we work through things for the first time, as a new team are learning to work together, and have (like everyone) been dealing with a global pandemic, it's still too long. Getting to market quickly is the difference between failure and success. "Done is better than perfect", nobody can use a product that doesn't exist, and all that. For us, we've got a couple other P-Type Loonshots we'd like to try out, and we'll have to move faster to be able to do that.
I've been around long enough to have experienced many times the phenomenon of the second implementation of a technology architecture being the worst one. Basically, the first time through it, where we currently are, you are making certain assumptions, and you're cutting certain things due to timelines, etc., while making mental notes to make sure we do those things next time. Some of those assumptions will be good ones, while others won't, and that will start to form decisions for the next time, and inevitably (and most importantly) a bunch of lessons will be learned once we're in production and supporting this thing for real, and we'll add those things to the next implementation, too.
And then the next time comes, and you end up adding all these bells and whistles and lessons learned and missed opportunities and you end up with this overly bloated software project that, while far more functional, took longer to build and ended up having layers of shit you don't really need, and half the time, it seems like it's that stuff that breaks the most often and needs the most maintenance. Ah, the irony. All the stuff we wanted to do the first time makes the second time take longer and be harder to use. This reminds me of another rant of mine, when job orchestration applications end up having a ton of options I don't care about but fail to deliver a good experience on the most basic of functionalities. But I digress.
Anyway, then the third and fourth implementations come along, you've got a couple of iterations under your belt, you know what works and what doesn't, you can cut away the fat and do things faster and with the right bells and whistles. This was my experience with serverless frameworks, and especially with DynamoDB. The first time through a serverless project, I was hooked, but (in hindsight) I made a bunch of mistakes, because I didn't know any better. The second time, I wanted to fix all the stuff I did wrong the first time, and it ended up being bloated, with lots of unnecessary bells and whistles. Then the third one comes along, and while you're still learning and adding things, you start to learn how to do the foundational stuff right, and you start to really know what the right bells are to make sure things work as expected and to assure that things are maintainable.
When it comes to the architecture of the initiative I'm working on now, there are a number of things I can't/won't share, but those things are not important to the discussion here. What is important is the baseline architecture that we can use to build out those P-Type Loonshots, and that's the rest of this post.
I've written previously about Function-as-a-Service (FaaS) and event driven architectures, so I won't dive back down that rabbit hole again. But what I will say here is that as I think about the P-Type initiatives that I mentioned previously, each of them fit this architectural pattern, which is partially why this baseline architecture works for each of them. Now, stay with me for a second as I go sideways here.
When I was a boy (cue my kid's eye rolling), The Mythical Man-Month was practically required reading for computer nerds. Then I go out to Wikipedia to grab that link and I see that it has publication dates of 1975 and 1995, which makes me want to start drinking earlier in the day. Anyway, there is a section in that book that talks about the similarities between the great European churches and successful software projects, and I reference that analogy often. I'll even bastardize it here to reinforce the point I'm trying to make. Many of the great churches, like the Sistene Chapel, were built over decades, and during that time, they inevitably went through multiple architects and designers, but as you stand on the floor of these churches and look up, you can't see where one designer ended and another began. Successful software projects are the same. We should strive to have a consistent theme across the initiative, not a Frankenstein's monster with parts bolted on that don't seem to fit with the rest.
What does this have to do with anything? Well, I think when you are faced with a greenfield initiative, like our P-Types, it's important as a Solutions Architect to really think about the theme, and (this is going to sound cheesy) let the right architecture present itself. Serverless, FaaS, and event driven architectures don't fit every problem perfectly. Case in point, I don't use those architectural approaches when I design and implement Data environments, when horizontally scalable distributed compute via Hadoop clusters remain king (albeit running Apache Spark, Flink, Presto, etc.; not still using the now-outdated traditional Map/Reduce paradigm), and, depending on the staff and organization, I might not even recommend it if I do think it's the best fit. If an organization doesn't have the right staff or the appetite to take on a given architectural approach, the worst thing you can do is force it. That never ends well. Luckily, that's not the situation I find myself in presently, and the architecture that has presented itself as the best fit for our current product and for the ones we're planning is, in fact, serverless, FaaS, and event-driven. Whoop whoop!
Our application is mobile first, and it has two mobile code bases (iOS and Android), which are beyond the scope of this post. Everything else is AWS specific, so I'll refer to AWS specific applications, but all the big three cloud providers offer basically the same services under different names and with some nuances, this could all just as well be done in GPC, for instance.
We use Cognito for OAuth, and that lends itself out-of-the-box to some Lambda functionality, which can be triggered at various points of signup and forgot password:
- Pre sign-up
- Pre authentication
- Custom message
- Post authentication
- Post confirmation
- Define Auth Challenge
- Verify Auth Challenge Response
- Create Auth Challenge
- Pre Token Generation
- User Migration
Interacting between the mobile code bases and our backend, instead of maintaining tens to hundreds of endpoints, like a traditional RESTful API design might, we opted for AppSync so we could use GraphQL, and minimize the number of "pipes" between the front-end and the back-end.
All right, now we're getting down to business.
We've got our two mobile code bases and a single pipe (if you will) that allows those code bases to reach into the backend, where we have consciously went with the backend-for-frontend (BFF) pattern. This allows us to have not only that single pipe via GraphQL, but for it to only connect to a single backend.
I'm a huge proponent of polyglot persistence and of DynamoDB, as I've written before, and that's our BFF backend. Now, I've also written about how DynamoDB's preferred single table design can sometimes be in conflict with an event driven architecture (or at least add to the challenges), specifically when it comes to filtering events through the usage of table names. I'll get to that in a second; hold your horses. What is important to note here is we are using DynamoDB and its glorious wide key/value NoSQL, schema-less, managed service with single digit millisecond response times and massive scalability.
Getting back to my OKR, we've got to build this thing out within a six month window, and here's a spot where that time is going to start to get eaten into. Spend the time to plan your single table design properly. Build the spreadsheet, take your time, really, really think through your access patterns, use generic key names, include a sort key for all PK/SK and GSI purposes, check your work thrice, take a break and revisit it with fresh eyes the next week, and be diligent.
So, let's generously give that a month, to be safe, which is happening in parallel with the mobile development. I recommend doing this single table data modeling first because it makes the GraphQL work significantly easier, and at that point, you've got some screens on the front end and a backend that allows for data to be retrieved to perform the actual functionality that the app requires. Since so much time and effort was dedicated to the access patterns, the GraphQL piece is relatively straightforward, and we can measure the time necessary for this work in terms of weeks.
Circling back to the challenges with single table and event driven architectures, what is traditionally problematic is in building subscriptions. Basically, when you isolate tables (perhaps I have a CUSTOMER table and a PRODUCT table and an ORDERS table, for instance), you can simply trigger a Lambda function from each of them to publish the NEW_IMAGE of the table that publishes the payload to SNS and uses the table name as the metadata for the message. Now, any downstream functions can add a simple filter by the table name and know when to fire. Care about orders? Just execute when the metadata reflects ORDERS. Care about product changes? Just execute when the metadata reflects PRODUCT.
However, and this is some pretty cool stuff, EventBridge allows us to trigger on attributes within the payload. Game changer! I knew this was coming, but around the new year, 7 months ago, my initial trials with EventBridge were filled with tribulations. It felt clunky, various frameworks we used didn't yet support it, and it generally felt like one of those AWS moves where they throw something into the market when it's about 75% of the way there and then wait for their massive captivated audience to give them feedback to make it better. Or, with things like Glue, they just let it remain a turd. But I digress.
Anyway...EventBridge has become more widely supported in various frameworks, so it fits in with our CI/CD now, and generally we did the needful enough to feel comfortable moving away from SNS and to EventBridge. And this keeps our architecture very straightforward.
One more thing: we have a number of Lambda functions that connect to EventBridge, but I'm going to call out one specifically, and then share a picture of this architecture. Here's more of that polyglot persistence. One consumer of EventBridge is a Lambda function that publishes data to Kinesis Firehose which in turn writes that data to S3, and then with object lifecycle, we make sure our data is also archived to Glacier. Here's the baseline of that complete architecture, so far:
Now, I'm only going to loosely cover this next section, in part because I've covered it in the past, but I think RDBMS technologies are largely unnecessary in this day and age. I like relational models, because it allows me to explore my data through the relationships amongst it, but I prefer decoupled compute and disk, with the former (bi-directionally) elastic and ephemeral (scaling in, to zero, is just as important as scaling out!), and technologies like Presto give me everything I want, including the declarative language, without the limitations of outdated technologies. On the downside, because I'm using EMR, I do have to have a VPC with a private subnet and all that shit, but hey, you can't have everything, and at least going the EMR route, I have more control over my Spark jobs than Glue allows. So, we can extend the architecture to include this base level data extension, and then we'll hook up some consumption services. At scale, I'd replace Athena with Presto on EMR, I may replace SageMaker with Jupyter on EMR, and I probably go the Looker or Tableau route over QuickSight, but all in due time. For now, at our size, the services make more sense.
One could argue that the data environment, despite being near and dear to my heart, is not a necessity for our P-Type rollouts, though you could certainly make an argument that it is (an argument I do in fact make). Regardless, provisioning EMR clusters to run ETL jobs against the data that is being captured is incredibly valuable, and gives us insights that help us to discover and push beyond our adjacent possible. Most importantly for this post, it's out of the box infrastructure that is a natural extension to the rest of this architecture.
And this post would be woefully incomplete if there wasn't mention of the other pieces that were implemented as part of our initial deployment; all the back-office stuff that makes it all stick together. That includes, but is not limited to, our CI/CD process that includes Github actions, logging and alerting via Datadog, additional alerting using everyone's favorite chat application, Slack, PagerDuty, infrastructure-as-code (IaC) using Terraform and Serverless, linters, repositories, and on and on. All of this requires patience and dedication to get it right, and our Cloud Engineering (traditionally called DevOps or Infrastructure) teams deserve a ton of credit for absolutely nailing this because we get to reuse it over and over again. And make no mistake, it is all of this work that really allows for the OKR to be possible.
So that brings us to the custom bits. The parts that make the new stuff that's around the corner P-Type and not S-Type. But that's just the other side of EventBridge. It's tying Lambda functions together and extending the architecture for downstream processing, for firing on specific event types and for building all the secret sauce components that we hope make all the difference. And we've got plenty of time left in our six month estimates from my original OKR to do just that, and to fit it all together with the baseline that I've described here. Now, we just have to make sure we don't make the dreaded second implementation mistakes...