I love speaking publicly. I know that isn't all that common, but there's something kind of exhilarating about it, which is why when I was lucky to be offered an adjunct professor position in 2019, I jumped at it. Prior to completing my first semester, though, I took a position in a country half way around the world and had to quit, which was bittersweet. Best move I have ever made, but it sucked to not be able to complete that work, which I really enjoyed. I'll probably never get another opportunity to teach at a university level again, and I still feel bad about having to leave that behind.
That said, there are other people foolish enough to occasionally invite me to bore them, too, and this past July I was asked to speak to some Masters students from a university in Auckland.
It's not the same thing of course, and it's really not the same thing during a global pandemic(!), but I still enjoy doing it, even if it's via Zoom. This talk was about considerations when building cloud native architectures, which is what I want to talk about here, in part because I keep encountering situations in which organizations I'm working with in one capacity or another simply don't seem to understand it.
Case in point: we've got a particular situation in which we need a vendor to push some data to S3, write it into some date partitions (so we can record the data in an Apache Hive metastore), and split the data into multiple objects, so that it's consumable in parallel, using a horizontally scalable processing language (Apache Spark).
Seems pretty straightforward, right? This has been a pretty common pattern for more than a decade. I mean, Apache Hadoop was open sourced in 2006, and these approaches have been around at least since that point! And, on top of it, both this vendor and ourselves are both using AWS as the cloud provider. If you're an Engineer on this one, seems like a no-brainer, right? We create a S3 bucket, give them IAM cross account access, they publish data to Kinesis Firehose and head off to the bar at lunchtime.
Except nope. Instead, they're going to somehow manage the splitting of data manually to create the various distributed objects, and they're going to manually manage the partitions, too.
Because why the hell would you want to use the out-of-the-box solution that would take hours to implement and leverage services built to do exactly the thing we need when you could take four weeks and a six figure estimate to do it worse? I mean, at least their solution will be harder to maintain and scale, but at least it will be more expensive and brittle for everyone involved.
Now, this particular organization is a .Net shop because things like .Net give me pause in my atheist views. I can't help but wonder: if there are tech companies that are all-in with AWS that refuse to give up .Net, then there really must be a Hell, and if there is a Hell, then there must be an afterlife, and perhaps I should start packing my sun screen, if you know what I mean. But I digress.
Anyway, what I want to do here is just talk a little about building native cloud architectures, and (regardless of whether you're using .Net or Ruby or Python or Node.js or whatever) what types of considerations you need to make.
Let's start with a quick look as to why organizations often move to the cloud:
- Perceived Cost Savings
- Scalability...(note I said “scalability”, not “elasticity”)
- Frees up IT Staff
- Buzzwords galore
Done correctly, and all of these benefits can be yours, but done incorrectly, and you're going to find yourself with bigger bills and bigger headaches. And that's one of the reasons I always always always tell organizations that they need to supplement their existing staff with serious cloud experience before embarking on a migration from on-premise to the cloud. Your on-premise experience is great for knowing your business and a ton of their technical skills will absolutely, positively translate, but architecturally, things are just too different, and you can't learn it from a book or a class. You have to bring in someone who has done it, over and over and over again, who has earned their scars and emerged on the other side enlightened from those learnings. You don't want to pay for your folks to gain those scars the natural way because it will cost you millions, and it will cost you years. And in the meantime, your solutions are going to suck, and then you're going to accumulate a whole bunch of horrible technical debt and when the light finally does start to come on, years later, you're going to have zero appetite to start over, and you might not have money left over to try anyway.
Therein lies our first moment of realization. Just getting your car onto a race track does not make it a race car.
Just putting your code into the cloud doesn't make it a cloud architecture. This is little more than the classic rehosting (lift-n-shift) paradigm, or, similarly, the replatforming approach.
When it comes to rehosting, you are going to move quickly. Because, well, you're really not doing much. You might think it's going to be easier to maintain, but it's the same code base. Maybe it's more scalable, but probably you just ditched your old server and put the code onto an EC2 instance. Maybe that EC2 instance is bigger than the server you had, so it's kinda more scalable, and certainly AWS is going to keep offering you bigger servers when you need them, but at a significant costs. You're not writing off that server purchase over time; instead, you just have a constant monthly payment.
Let's talk about this for a second. When you have a server running on-premise, it's always plugged in, even if it isn't running at 100% capacity all the time. And it isn't. You need a server that can accomodate your peaks, but peaks are just that-it's not constant. The fictitious chart below indicates the computational resources that a server might use over a 24 hour period:
But let's take that a step further. While our server might need lots of compute overnight when some regularly scheduled loads are running, then drop down early morning and ebb and flow throughout the day before dropped down after business hours and spiking back up for the next night's scheduled loads, we can extrapolate things out even further. Maybe this is the backend RDBMS for our online store, and while we can pretty well guesstimate out our needs on a normal daily basis, what about Black Friday and the lead-up to Christmas, when things shoot up like no other time during the year. Now we need even more compute that is going to be unused for 90% of the year:
Having to forecast the size of the server you need is called provisioning for peak, and if you're having those conversations and you're in the cloud, you're doing it wrong. I'm going to circle back to this, but before we do, let's just call out another point or two from those expected benefits from the lift-and-shift approach. You do have improved security, because you get AWS security out of the box, but you still have to layer on your security, too, so you're not avoiding security by moving to the cloud. All right, moving on.
I'm going to largely skip over repurchasing. Basically, that's when you decide to throw away your home-grown solution (e.g. CRM) for something out of the box (e.g. Salesforce). Probably a topic for another time, but not my point here.
What I'm talking about when it comes to architecting native cloud solutions is really the box in the upper right: refactoring. Building cloud native solutions that take advantage of the pricing model of the cloud, and take advantage of the development speed, breadth of technology opportunities and architectural capabilities that the cloud enables us. Until you start to build native cloud solutions, you're driving a station wagon. And let's be honest here, nobody dreams of ever driving the family truckster.
All right, let's throw a cloud-native rule of thumb out there: never pay for idle compute.
One of the many beauties that the cloud affords us is the ability to scale bi-directionally (that's elasticity). Using our earlier images, we're saying that we pay for the line, not above it!
How do we do that? Another cloud-native rule: by decoupling our compute and our storage, and managing them independently. If we're in a Data environment, maybe that means we store our data on S3 and use EMR for our compute. Or, perhaps we're building a data intensive application and we persist our data to DynamoDB and leverage Lambda for our computational layer, amongst a myriad of other options.
Here's a non-cloud native approach to show the difference, using a traditional RDBMS as an example. In that traditional setup, we have our compute and our disk co-located together. In that case, if we need more of one (say, disk), our option is to buy a bigger server, with which we get more compute, even though we didn't need it (and the price of disk is quickly approaching zero, but the cost of compute most certainly is not):
Instead, by decoupling disk and storage, we can manage them independently, recognizing that the amount of disk we're going to need over time is always likely to increase, while the amount of compute we need is significantly more up and down, with spikes that are short-lived.
And another point about compute (and another cloud-native rule of thumb), and scaling that independently. It is as-important to scale to zero ("scaling in") as it to scale out to meet increased compute needs. This is exactly how we exploit the pricing model of the cloud: by scaling in to zero when we don't need compute, and by using computational services that allow us to pay for usage, like Lambda, for instance. If I am not firing any functions (compute), then I have no costs.
When we've decoupled our compute and storage, we also get to take advantage of yet another cloud-native rule of thumb: polyglot persistence (and polyglot computations). Basically, using the right application for the job at hand, as opposed to an on-premise solution.
When you have an on-premise solution, you don't get to try very many things because failures have really massive costs. Not only are you paying to get some software (and, probably hardware for it to run on) into the organization, plugged in, secured, administered, installed, configured, etc., but then you do a quick-and-dirty PoC and decide it's not going to work, and you're screwed! Imagine you want to leverage a graph database, and you get yourself signoff for a server and go through that rigamarole only to find out that the idea doesn't have merit. How many times will the boss sign off on that little game? In the cloud, you have a wealth of things at your fingertips, from graph databases to cache layers to FaaS to RDBMS to distributed DBMS to natural language processing to hadoop to GraphQL to text-to-speech to facial recognition, and on and on, and you can try them out with a couple of clicks of your mouse!
The bottom line there is to use the right tool for the job. Use an RDBMS when it makes sense, and a Document or K/V datastore when it makes sense and a Graph database when it makes sense and a Cache layer when it makes sense. Use a streaming technology when it makes sense and Function-as-a-Service when it makes sense, and microservices where they make sense and horizontally scalable distributed compute when it makes sense, and so on.
Here's a simplified version of this. There's a mobile app, backed by GraphQL, using a wide key/value store, backed by a on-demand FaaS compute that writes to a messaging bus, more FaaS compute behind it, some of which is streaming to an immutable object datastore, where it is processed using open source software on an ephemeral distributed compute platform, then consumed by a series of services (pay for what you use). Another stream on the backend of the messaging bus is writing to a natural language distributed compute environment, where native visualization applications for the compute are providing visibility into the data, and finally, at the bottom, a graph database is yet another consumer from the messaging bus. And this is just a sliver of what a full architecture might look like, for illustration purposes! Imagine the cost of trying to do all of this on-premise. Sure, it's doable, but who the hell wants to pay that price when it's all out of the box in your hosting platform, at a fraction of the price if you're doing it right.
Now, I'd be remiss if I pretended that every organization is ready to do this. I recommend doing an honest assessment of your company to make that determination:
- Know Thy Organization
- Do they have the right staff?
- Do they have enough money?
- Do they have the tolerance?
- Do the ones that make the decision and the ones that write the checks truly get it?
- Are they willing to bring someone in with experience to help guide things?
And further determining what it is they really want out of the cloud
- What Do They Really Want Out of the Cloud?
- Minor cost savings, easier maintenance and some basic improvements or
- A shift towards technology as a differentiating factor; the true opportunities that the cloud offers (are they looking at this as a project with an end or a never-ending journey?)
And if they are ready for a real mindset shift, towards a never-ending journey, do yourself a favor and get yourself someone with some real experience doing this stuff, and get rid of your .Net bullshit so I can go back to be a perfectly content atheist.