tl;dr Frighteningly often, something as simple as a scheduling (job orchestration) application can be problematic enough to destroy an entire architectural initiative. To avoid this, we have to understand the reason, positioning, and correct usage of these potential architecture-destroyers.
This post may end up being more rant than informational blog, but because I've seen scheduling applications literally destroy entire analytic platform initiatives, I'll dedicate the afternoon to writing it in hopes it helps someone to avoid the pain I've experienced more than once.
Let's set the stage by putting together a fictitious (albeit common) scenario. Imagine your organization (or client) has decided to move away from their monolithic MPP solution, such as Redshift, in favor of a more modern architectural approach for their data science and analytics platform. This isn't to say that Redshift (et. al) don't ever live up to their hype, and this post doesn't intend to sway organizations away from Redshift. Rather that we're using this example as one organization that made that decision for a number of reasons, some of which I'll briefly cover here.
Many organizations turn to technology solutions like Redshift hoping for a turnkey solution in which they can shift from their (e.g.) Oracle RDBMS Data Warehouse to the cloud, only to be let down by the results. This can be due to a number of factors, such as poor data distribution techniques, or an attempted lift-and-shift of an existing physical data model, or because of challenges that are common in all technologies that stemmed from the 2006 MIT C-Store paper (including ParAccel, which became Redshift when AWS acquired it), such as concurrency issues, or the costs associated with tightly coupling compute and storage, which is exasperated when we're talking about clusters.
As part of this more modernized platform solution, let's further imagine that your organization has decided to leverage distributed compute technologies ephemerally, like EMR clusters with an open source distributed SQL engine (such as Presto), and decoupled persistent data storage using AWS S3. Further, your Engineers have decided on a more modern data integration application, such as the open source option Apache Spark. In order to achieve all of this, the Hive metastore is located on a (small) MySQL RDS instance which can be leveraged by these ephemeral EMR clusters. By taking this approach, you can terminate your high-cost Hadoop clusters without losing data (S3) or the metadata about it (RDS), which includes the format, layout and location on disk, and you can provision new clusters and make the decoupled data immediately accesible. A (very) high-level look at this architectural approach is shown below:
This architectural approach, while seemingly basic at this level, is quite involved, and well beyond the scope of this discussion. However, it's important to call out a few highlights of this approach:
- Exploits the pay-for-what-you-use/leasing model of the cloud, by leveraging compute ephemerally (EMR)
- Decoupled compute and storage allows for fully scalable, inexpensive storage (S3)
- Open source (free) data integration application, using Apache Spark
- BYO Programming Languages: resources can leverage their favorite programming languages and use their existing skills. Data Scientists may choose to use R, Python, etc., Analysts can leverage SQL, Engineers can leverage Scala, etc.
- Open source (free) distributed SQL engine application, using Presto
- Infinite scalability for both compute and storage
- Because we're leaning on open source technologies (Spark, Presto), we get to take advantage of the larger tech community to continue to improve the offerings
- Engineers want to work on emerging technologies. While rarely a driving factor behind moving technologies, we all know how hard talent acquisition is in this day and age, and adding technology appeal to an organization has a lot of merit and a myriad of benefits both for finding and retaining talent
As the Engineers are busy porting and re-writing ETL, adjusting data models, and implementing all the bells and whistles that they've wanted for years but didn't want to add to the legacy system with all its previous sins and buried bodies, the only thing left to add is the scheduling tool, which seems like it should be the easiest component in this architecture...
First, let's quickly call out what we're looking for here, and some of the current open source players in this space. I'll limit it to a small handful of open source options to continue with our previous trend:
- Airflow (originally developed by AirBnB)
- Luigi (originally developed by Spotify)
- Azkaban (originally developed by LinkedIn)
- Pinball (originally developed by Pinterest)
There are many, many others, and certain related offerings, like Jenkins offer workflow management functionality, but I'll keep this list limited to these few as this post doesn't intend to offer comparisons or technology recommendations (although you'll certainly get some opinions). The bigger point is simply to provide some examples for the reader to explore at their leisure.
In my opinion, these applications go astray in what it is they offer (which might be why so many companies feel the need to create their own). To be sure, when you build an application like this, it's natural to add any number of bells and whistles, which can make it far more appealing, as well as address site-specific requirements (and I guess assuming that others in the larger tech community share these previously unmet requirements). If we can add additional functionality, why not? Airflow, for instance, can be leveraged as a poor-man's ETL tool, if desired. I'll get into why I recommend against this approach in more detail shortly, but I've seen it fail or massively under-deliver in the wild multiple times, and I've spoken with even more organizations who attempted to leverage it for the same, with the same results. I'll discuss why those deployments went the way they did shortly. This isn't about pointing out my perceived Airflow flaws, especially when the primary use should be to leverage it for job orchestration, so let's quickly talk about what it is we want from our job orchestration application in this architecture.
It seems simple enough, at face value. What I want from a job orchestration application is to initiate and monitor our (Spark and Presto) jobs on our ephemeral EMR clusters based on points in time and/or the successful completion of upstream processes via real or logical Directed Acyclic Graphs (DAGs), to retain some history (so we can report on run times over time, for instance), be able to be restart DAGs (in the event of a failure, for instance), a GUI interface to make life easier, and to be alerted when something goes wrong, or at least some API layer to enable us to easily write it ourselves.
That's it. I don't need to run any transformations in our job orchestration application. I don't need the job orchestration application to connect to my RDBMS instances, access data, or any other bells and whistles. I have other technologies that do those things, and those technologies are built to do those things.
This brings us to the first point of contention I often encounter in organizations. There is an argument (which I understand) that says "if a technology can do A, B, and C, why wouldn't we take advantage of it?" Makes sense, on the surface. If our job orchestration application can do a simple ETL task, why would I introduce a different technology (e.g. Spark) and have two processes (one to perform the task, one to schedule the execution of that task) when one tool can do both? Quite simply, the further in bed we get with an application, the harder it is to pivot off of it. If every "easy" task is implemented using our scheduling application and "more involved" integration tasks are implemented using Spark (in our example), you're making it harder and harder to ever pivot off of the scheduling application if and when a better mousetrap emerges, and in todays fast-paced age of technology, new mousetraps are emerging all the time! Simply put, by doing too much with our job orchestration/scheduling application, we make it harder to ever move off of it.
We'll imagine that because we're aware of the risk, we're immune from this trap (we're not), and we'll imagine that we've positioned our chosen job orchestration application correctly. In this case, we'll choose Airflow, but it shouldn't really matter because we're not really doing anything beyond the very basic requirements we discussed previously. Theoretically, this might be true, but of course in practice the selection of an application, each with their own strengths and weaknesses, does matter, but I'll hand-wave over that (important) detail. We'll persist a small EC2 instance and install the application on it, as shown below:
Full disclosure, I've worked at a number of organizations (both as a consultant and as a FTE) in which Airflow was chosen for the job orchestration application, and I want to quickly cover the challenges that led to the undesired outcomes they did for three of them:
- [delivery company] Scalability (and thus, cost). Deployed as described above, on a single, small EC2 instance. Before long, Rabbit MQ was needed, and the single EC2 instance was insufficient to manage the job orchestration of a medium-sized ETL workload, which led to a persisted cluster and significantly increased costs. There should be no reason for a scheduling application to require a cluster simply to initiate the execution of scripts.
- [retail company] Technology mis-use and corporate inertia. Attempted to leverage the application for both job scheduling as well as for some data integration capabilities, despite having other technologies already in that space. The code base quickly became unwieldy and unsupportable. Further, this organization was struggling to grasp the differences between a traditional on-premise, single-server RDBMS data warehouse and a hosted, distributed compute architecture, and the complexities of leveraging this application proved extremely challenging. Implementation took nearly a year, to put it in perspective.
- [tech company] Scalability (and thus, cost) and technology mis-use. See #1 above, with regard to scalability. Additionally, the application was incorrectly used. The desire was to make the application self-serve (ETL, Analysts, Data Scientists), which led to a number of "templates" being developed, which in turn led to the entire design to be un-maintainably brittle.
To be clear, it isn't the application, per se, it's the usage of it. The software itself is very impressive, and those who contributed to it should be proud of the work they did. I'm sure the Engineers that developed the offering are far better than I, but how the application is positioned and some of the offerings it has are incorrect, in my opinion. As such, I can't reinforce enough the importance of using the application for the tasks at hand and nothing more. In the cases in which I had the misfortunate of being at organizations that chose it, it was supposed to be just the job orchestration application. If, instead, your organization decided up front that they wanted to use it as the ETL application and job orchestration application, and scale isn't important, and you have existing Python skills and you're not doing much in the way of data integration, maybe Airflow will be fantastic. But I really doubt it. Use a data integration application for data integration. Period.
The job orchestration application that I curse the least (that's as good as I can get with this; I'm a bitter old person by tech measures) is Azkaban, but I'm not sure it's even being supported anymore. The reason I liked it least-worst was because it did a decent job of doing just the things I wanted it to do (see above) and not much more, and didn't require a cluster of servers simply to initiate and monitor job executions. It didn't have all the bells and whistles of some of the other applications, and thus it was far less tempting for it to be mis-used. That said, all scheduling applications are over-cooked in my opinion. I wish someone would build a Python-based job orchestration application that did just the basics and did them insanely well. Quit spending time trying to make it an ETL application and devote that time to polishing the simplest requirements: build DAGs, retain some history and reporting, alert on error, be able to initiate an execution remotely based on successful completions or points in time, give users a GUI to do it, so that it's accessible by everyone and checks the "self-serve" box.
As a bit of a sidenote, if you've seen any of my previous posts, you know I'm a big proponent of serverless, microservices and Function as a Service architectures (FaaS), which don't necessarily have a job orchestration application in them:
“[In traditional architectures], all flow, control, and security was managed by the central server application. In the Serverless version there is no central arbiter of these concerns. Instead we see a preference for choreography over orchestration, with each component playing a more architecturally aware role—an idea also common in a microservices approach.”Mike Roberts
But we're not discussing FaaS architectures here. We're talking about our batch processing and our ETL stack, and we (arguably) do need a job orchestration application here, so I mention choreography, but only for completeness. Back to our discussion and the point of this post...
So you might be saying "okay, fine, I get it, don't force a square peg [the job orchestration or scheduling application] into a round hole [using it for things that aren't basic job orchestration, like ETL], so we're fine. We can use Airflow or Luigi or Jenkins or any other application and the title of this post is moot and I'll never get this time back. Thanks for nothing." Not so fast. Lots of folks "got it" and it still ended up being a disaster, so there's more to it than that.
The truth is, even the basics are more involved than they seem to be. Case in point, our architecture in this post leverages ephemeral distributed compute (EMR) clusters, which is completely bi-directionally scalable, exploits the cloud pricing model, and leverages decoupled persistent storage (S3), but it comes with some additional complexities, such as transient IP addresses.
This isn't especially difficult in isolation, but it warrants some discussion, nonetheless. Imagine we provision a 100 node EMR cluster for an overnight ETL cycle that begins at midnight and ends at 3 AM, at which point we terminate that cluster and provision a separate 10 node EMR cluster for some mini-batch ETL that runs from 3 to 8 AM and provides compute for any early-morning Presto queries, followed by terminating that in favor of provisioning two separate EMR clusters that run from 8 AM to 6 PM. One of those clusters is 5 nodes and takes care of inter-day mini-batch ETL processes, and the other is a 20 node cluster for our Analytics, Data Science and data visualization needs. At 6 PM, we terminate those existing clusters and again provision a 10 node EMR cluster for mini-batch and to catch any after-hours Presto queries, which we terminate at midnight, when the cycle begins again.
We're exploiting the cloud pricing model by leveraging our compute ephemerally, we're right sizing the compute for the workload, and we're further segregating our workloads while also leveraging the clusters for multi-tenancy. Our ETL processing happens on right-sized clusters (perhaps even using node types that are optimized for the applications that are running on them), same for our analytics and data visualization processing, those clusters are scaled in and out as needed. Great. But, for a given day, we have provisioned five separate clusters, each of which has a different IP address (which changes each day). If our job orchestration application is going to initiate and monitor processes on these different clusters with transient IP addresses, it's going to take more than out-of-the-box functionality.
One viable option might be to leverage AWS Lambda functions and store relevant information (such as the IP addresses) in a document datastore, like DynamoDB, so when it is time to initiate an Apache Spark ETL job (for instance), we "reach in" to DynamoDB, retrieve the IP address for an available ETL cluster, and then remotely initiate the job on that cluster. The ongoing diagram is updated using that approach, including separate Lambda functions that fire based on CloudWatch events to provision and terminate/de-commision EMR clusters at specific times of day, while publishing that information to a SNS topic, which in turn triggers a separate Lambda function to write the data to DynamoDB (a classic FaaS architectural approach):
Yikes. That got a lot more involved really quickly. Now we've got Lambda to worry about, as well as DynamoDB. It is worth considering leveraging some open source solutions to do some of this work, such as Netflix's Genie (which I highly recommend) to do your big data orchestration and Spinnaker to manage provisioning hardware, but again, with each new application we introduce, we have new stuff to monitor, new points of failure, and new challenges. For now, we'll pretend we're rolling our own, because our 5 cluster requirements are pretty straightforward. That said, we do have to manage things, and we have to worry about observability and monitoring, as well as alerting, so I'll add to our architectural diagram: logging to Cloudwatch, persistence of those logs to S3, and we'll use Slack for alerting, with Lambda functions performing the computations for all of it:
To be sure, this is still only a partial, or high-level, look at this overall architecture. Minutes ago, this architecture seemed pretty straight-forward: use EMR clusters that are "right-sized" for your computational needs, leverage storage that is inexpensive and decoupled, open source applications for the ETL/data integration, and an open source distributed SQL engine, while positioning a job orchestration application that does only the most basic of tasks. Our architecture is still incomplete, but it's clearly far more involved now, and that's kind of the point here. Inevitably, addressing a simple requirement (a job orchestration application), even while being cognizant of positioning it correctly by limiting it to only basic functionality, is going to lead to challenges, and can lead to disaster.
Don't underestimate the impact of even the seemingly smallest of things. There are at least tens of different options for something as simple as a job orchestration application. Whichever one you choose, recognize they all suck because they've spent way too much time on functionality you shouldn't use, and minimize it's usage to only the basic functionality you require so that you can pivot to new and emerging applications and options. Consider leveraging a choreography approach instead of orchestration where possible. Understand that every decision has ramifications, that every technology and/or application that we introduce adds new complexities. Don't be clever: keep things as simple as possible, and abstract complexities away. Recognize that technologies change over time and design your architectures to take advantage of that by component-izing functionalities.
Most importantly, never forget that the platform and the architecture are but means to an end. The real value to an organization comes from the analytics and data science themselves, not the applications that are designed to support them. Tech should be like a good umpire in a baseball game. If you didn't notice them, they did a good job. Enable and empower your organization by minimizing visibility into the complexities that inevitably exist, and for goodness sake, don't let something as simple as a job orchestration application be the death of your architecture.