tl;dr The price of disk is quickly approaching zero, but to shift from a RDBMS that performs updates natively to an immutable file (or object) system is more involved, and requires that we approach modeling and managing our data differently. Being able to apply updates is an age-old capability; how do I accomplish this functionality in this brave new world?
If you've read anything of mine before, you know I'm a huge proponent of AWS S3, and I'm also a big advocate of using RDBMS systems more sparingly than in the past, because technology has afforded us the ability to do so. That doesn't mean that I am anti-RDBMS. Rather, it means that we should consider where they make sense and use them in those settings; not necessarily everywhere we have data that is relational.
Case in point, when we're dealing with "big data" (damn you, marketing folks for making that term mainstream), like we would in a Data Science and Analytics platform. Here, we absolutely have relational data, but to automatically assume that that means we have a RDBMS in the picture is incorrect. Some of this has to do with costs. If we're dealing with petabytes of data, we need a RDBMS that can handle both that workload (compute) and the disk it takes up. Both compute and storage are finite within a traditional RDBMS, whether that be MySQL, Oracle, Snowflake, Redshift, Teradata, or what-have-you, because the compute and storage are tightly coupled (on the same machines). Obviously, as we accumulate more data, we need more disk, and those servers also contain additional compute resources, and all of that adds up to more costs. Further, because we recognize the value of data over time, we rarely see disk usage decrease, and if it is, someone had to make some tough decisions that may end up biting them.
Take, for instance, a recommendation engine. Perhaps we're recommending you a book to read, and to do that, we are accumulating books that you and others have read, as well as crowdsourcing opinions/reviews of those books. We could use something like a K-nearest neighbors machine learning algorithm that will recommend a book to you based on other things you've read, trained against a history of accumulated data that includes other people who liked that book and what other books they liked. Over time, we continue to collect more and more information, and before we know it, we've filled up the disk we have available. Sure, we could delete the oldest information, but is old book review data irrelevant after a point in time? Of course not. More realistically, we could aggregate or summarize historic data and continue to make it available for model training. To be fair, if we have summarized the historic data, we may be able to delete the detailed data from those book reviews, but in any event, the total amount of data that we're persisting inevitably continues to increase.
This makes file based systems extremely attractive. I'll use S3 in this post as an example, because I am most familiar with it, but other cloud providers have similar offerings. Technically, S3 is an immutable, flat, object storage service, as opposed to a file based system:
“Object storage essentially bundles the data itself along with metadata tags and a unique identifier.”Cloudian.com
As opposed to a traditional file based system:
“You name your files/data, place them in folders, and can nest them under more folders to form a set path. In this way, files are organized into a hierarchy, with directories and sub-directories. Each file also has a limited set of metadata associated with it, such as the file name, the date it was created, and the date it was last modified.”Cloudian.com
These are important distinctions, but I'll be lazy and for the sake of this post use the terms (albeit slightly incorrectly) synonymously. The bigger challenge to me, and the point of this post, is that we cannot simply extract the data from our RDBMS and place it into an immutable file system and pretend that we're done. One reason for this is that key word: immutable.
“In object-oriented and functional programming, an immutable object (unchangeable object) is an object whose state cannot be modified after it is created. This is in contrast to a mutable object (changeable object), which can be modified after it is created.”Wikipedia
RDBMS allow us to update data on disk. In real life, those updates are an abstraction of playing with pointers to distinguish which data on disk to display (the newest). Let's use some visuals to show this approach.
I'm going to start by talking a little about Apache Kafka. Kafka is a distributed streaming platform, and at the time of this writing, is arguably the preeminent offering in the market for what it does (AWS is now offering a managed Kafka service in addition to their traditional offering in the same space, Kinesis). Like S3, Kafka logs (disk) are (is) immutable. When performing a logical update on Kafka, the offset in the log is adjusted, thereby showing the most recent version of an object. Technically, no update to data on disk has occurred, but the data "pre-update" is hidden and the "post-update" data is made available via this approach:
This is surprisingly similar to what is happening under the covers within a RDBMS system. ACID is a set of properties of database transactions, which have a serial order. When an update is applied, a new version of that data is created, and when that data is accessed, the newer version is returned:
Let's look at this with a practical example. First, imagine we have a customer table that looks like the one below. As a side note, this is a bad design. Physical addresses are finite and are not attributes of the customer. The correct way to model this would be to snowflake it and separate the physical address (often referred to as "door fronts") into a separate table. Multiple people can live at a door front, and people move between door fronts. Further, stores have physical addresses, customers have multiple shipping addresses, etc. But for now, we'll pretend that this is our customer table for the purposes of discussion.
When we query the address for a specific customer, it will return this address (123 Main Street), which is stored in some physical location (I called it 1001 in the following example). The data on disk contains a single version of the address, as depicted below:
Now, imagine we need to update the physical address of a given customer, which results in a new version of the data on disk, and adjusts the pointer, if you will, so that anyone accessing that customer's address will return the newest transaction, which is located at a separate location on disk, as shown below:
Why is this important? Because in an immutable file system, we still need to functionally perform updates, and we're going to accomplish that by essentially performing this same action. The RDBMS abstracts all of this away as out-of-the-box functionality, but we can essentially do the same thing as what was happening in this example with just a bit more consideration.
Before continuing, the most important thing to keep in mind is that an update is simply organizing the data in such a way so that when it is accessed, we return the most recent version of the data. Other writings about this topic always seem to include the phrase "accountants don't use erasers", which is a great analogy. When they enter a transaction that later needs to be removed or updated, they make a new entry that reverses the transaction. They don't go back and change the original one.
We're going to be taking this same ACID/accountant approach with our data on S3, except that instead of applying this transactionally, we're going to do it in batch.
Let's start by envisioning our customer data in a logical partition on S3. This logical partition is little more than a grouping of objects in a S3 location. Side note: S3 is a flat object store, but even AWS has decided to make it more approachable by introducing the concept of "folders":
So, I feel okay about using the term, too. Basically, imagine we have a S3 bucket (myBucket), which contains a folder that is named customer and within that folder (partition), we have a series of objects, which contain all of our customer information, spread across these objects (which we can, for all intents, think of as a set of "files"):
Our Hive metastore contains information about our customer table, including the location on S3 in which the associated data is stored. Whenever anyone runs a query, like select * from customer, the hive metastore knows that data is located at s3://myBucket/customer, and retrieves the data stored in the objects/files within that folder, and everything works as expected.
Now, imagine we want to perform an update. Because S3 is immutable, and because we don't know which files contain which rows, we have to take a different approach, although logically, it works essentially the same as we've been discussing. We're going to introduce a new concept here, which we'll call the batch_identifier. It is nothing more than a new high-level folder in S3, within the existing customer folder, thus creating a hierarchal structure, like we'd have in a true file system:
Our new folder is named batch_identifier=123, and the hive metastore now contains an entry that says "whenever someone queries the customer table, pull all the objects within s3://myBucket/customer/batch_identifier=123", and things continue to work as they did earlier (it doesn't matter how many "folders" exist within the hierarchy).
Here's where things get interesting. We cannot perform individual row updates, as we discussed earlier, but we can perform all updates in batch, which is a normal approach to data integration/ETL anyway. Often, you will see an ETL process that runs nightly, which inserts all new data into the RDBMS and performs all required updates (and deletes), too. In normal analytics and data warehousing platforms, those data mutations (inserts/updates/deletes) are done via batch processing, not transactionally as updated data arrives. So, let's imagine that we have a handful of updates, and a handful of new data (traditional inserts). The way we handle it here is simply to create a new batch_identifier folder (e.g. batch_identifier=456) and we'll restate the data in its entirety into this new folder. The output, upon completion, will look like this:
Once you write the new data (batch_identifier=456), you tell the hive metastore "the customer data is now located in s3://myBucket/batch_identifier=456", which is accomplished via an ALTER TABLE...SET LOCATION HiveQL command. Now, whenever the data for the CUSTOMER table is accessed, it will retrieve the information from the objects (files) in the new partition/location on S3.
You'll no doubt notice that the original batch_identifier folder (=123) continues to exist. At this point, perhaps once you've validated the data, you can reasonably delete that data, or perhaps apply an object lifecycle management policy to it, so that it expires after some number of days or is migrated to even less expensive long term storage, like Glacier (which would give you a failsafe of being able to recover to a point in history). This general approach is more problematic if you are persisting your data in HDFS (an architectural approach I recommend against, as it tightly couples compute and storage, increases costs substantially and limits scalability), but in S3, where storage is seemingly infinite and very inexpensive, costs are minimized, this provides you recoverability and the ability to perform bulk updates in essentially the same manner as was done in traditional batch ETL.
Further, you can apply this concept to data in mini-batch, which is to say, perhaps every four hours. With each execution, a new batch_identifier folder is created, and the data, again, in full, is restated. To reinforce an earlier point, again, this is key: whenever you write a new batch_identifier folder, you must restate the data in its entirety because the hive metastore is only going to know about a single partition/location for this particular table.
To be clear, it is absolutely possible (and is normally the case) that your hive metastore would know about multiple partitions. The customer table being discussed here is an exception, since it represents the entirety of the customer base as we know it at any point in time. But for something like purchases, for instance, we'd likely add another level to our s3 object hierarchy, representing the date, as shown below:
And with each new batch we write, we inform the hive metastore the location of the purchase data for a given date is located at this s3 location. Users can then use the partitions as part of their WHERE clauses (e.g. SELECT * FROM purchases WHERE transaction_date = 20190927). Failure to include the transaction_date would result in pulling all the purchase data, which is going to be less efficient. Logical updates to the data would work exactly the same. We simply add a new batch_identifier to the data, and update the hive metastore accordingly. In the scenario below, perhaps we updated the data (restated it in full) for 20190927. We write the new data, and update the hive metastore to reflect the updated S3 location for that particular date, so that whenever a user extracts the data for that date, it now returns the information from the batch_identifier=989 folder. Any access of the data from 20190928 continues to pull information from the same batch_identifier folder as it did previously (batch_identifier=772):
As they say, immutability changes everything, but the reality is that the steps that take place to perform an update in an immutable file system are essentially exactly the same as they are in log based technologies like Kafka, and they're essentially the same as ACID transactions in a traditional RDBMS. The full restatement of data is different, and likely feels like overkill at first glance, but that's one beauty of distributed systems: we have the computational power to be able to process large amounts of data quickly. I've worked at organizations with petabytes of data and tens of millions of customers or more, and the approach herein scales effectively.