Companies like to complicate things.
There’s no better way to say it.
The law of entropy clearly applies as much to tech stacks and data management as it does to closets and desk drawers.
In the world of databases, data management, and data platforms, this entropy usually takes the form of a simple database or data platform that might be ideal for early use cases evolving (or rather, devolving) into an expensive and unmanageable nightmare due to operational strain from use-case gluttony.
The most obvious and common way this happens is when companies try to evolve their caches into a data platform that can, for example, be used as highly available enterprise key-value stores for volatile data.
This post explains how this typically happens and why it’s a bad idea.
Table Of Contents
- The Cache-Turned-Data Platform Issue Explained
- Step 1: A simple cache
- Step 2: A cache with local storage for when you need to restart
- Step 3: A “write through” cache
- Step 4: A highly available cache
- Step 5: A scaled cache
- Step 6: Cache as streaming data platform
- Step 7: Trying to cope with write contention
- Step 8: Hitting a wall
- Step 9: Volt
The Cache-Turned-Data Platform Issue Explained
Caching is easy with large, nearly static data sets but very difficult with volatile (ie, rapidly changing) and/or complex data.
Companies typically start out with a simple, harmless cache, then slowly add more functionality until the cache ends up looking a lot like a data platform—because it is a data platform. Software always evolves, but not always in a good way. Sometimes it fails to meet the new need, and sometimes it does so by creating technical debt. Once a product or API is in widespread use, radical, non-additive change is really hard to do.
Let’s look at a typical scenario involving the javax cache API, also known as JSR107. A number of products implement or shadow JSR107, such as:
- Hazelcast
- GridGain
- Oracle Coherence
- Terracotta Ehcache
- Infinispan
JSR107 is influential, and references to javax.cache are common in code. But if you read the list above you will see that not all of these are caches. In fact, many of these products appear to be solving a market need for something that looks like a cache but is really something else. While you don’t need JSR107 to do a cache, any cache you do create will hit the same issues.
Step 1: A simple cache
Imagine you start with a basic cache with support for ‘Get’ and ‘Put’ operations: You need to store stuff in local RAM so you can access it quickly, but don’t want to mess with low level code. Why can’t you just use an abstract cache that loads data lazily? How hard can it be? It’s a HashMap with a few extra features, after all… so you implement one.
Step 2: A cache with local storage for when you need to restart
Your basic cache worked really well, and became an essential part of the system, as the backend system couldn’t support the read workload. But this created a problem: If you need to restart such a cache while the system is running, it will take ages for the cache to be refilled, and you may also hit performance problems. Remember: The whole reason you adopted a cache was because the backend system couldn’t handle the load. So you move to a cache that stores its contents on a local disk, which means it can recover gracefully from a restart. This is why JSR107 has methods like ‘close’ and ‘clear’.
Step 3: A “write through” cache
The next ‘ask’ is a ‘write-through’ capability; you need to be able to change the cache and then flush the changes to a backend system because if you update the back-end system directly anyone reading the cache will get stale information until you reload it. At this stage, your cache is starting to look a bit like a database, especially if your “write through ” functionality works even if the backend system is down. You’ve now made your cache the ‘system of record’ without ever planning to do so.
Step 4: A highly available cache
You’ve now evolved from a simple cache into something rather more elaborate. The next ‘ask’ is for high availability: because what you do is now too important to rely on a single server. At this point, about 80% of the time people hack the codebase to create an Active/Passive pair of servers, generally with mixed results. You can read from the passive node, but you can’t write to it. Neither will the passive node be 100% up to date. This brings you to your next big issue: scaling writes.
Step 5: A scaled cache
At some point, you need to provide a cluster of ‘cache’ servers to support your ever-increasing write workload. This unleashes a tide of complexity under the covers, as who is allowed to modify what and where becomes a major issue. You are unquestionably and unarguably in cache-as-database territory at this point.
Step 6: Cache as streaming data platform
You now have many, many users, so a requirement for a change listener appears. Change listeners allow you to listen in to changes made by other people. They are the kind of things that looks simple on a whiteboard but can be an implementation nightmare. You are now dealing with a lot of complexity, especially in edge cases where one node in your cluster goes down. In fact, while up to now you’ve slowly (albeit, unintentionally) evolved your ‘cache’ into a ‘database’, you’re now taking a big leap and turning it into a ‘streaming data platform’, where you find out about events in the real world and then tell other people about them. In fact, there may not even be a 1:1 relationship between input and output records anymore. Remember how this all started out as a simple cache?
Step 7: Trying to cope with write contention
And here’s where things start to get ugly. Since you’re now doing a lot of writes to your cache, you start to hit issues where multiple people try to update the same thing at the same time and overwrite each other’s work. The JSR107 spec addresses this with a ‘replace’ method where you pass the old record in as well as the new one, and only proceed if your old one is still in place. This is a poor man’s row-level lock, in that it prevents people from corrupting the data at the expense of forcing them to retry multiple times. This shows up as long tail latency in performance sensitive applications, and there’s also the cost issues of sending multiple copies of a potentially large object across the wire and doing an expensive byte-by-byte comparison at the other end.
The last step of this evolution attempts to solve this problem by introducing the concept of ‘invoking’ chunks of code on the server itself, instead of reading the data, changing it, and sending it back. This is done using an implementation of EntryProcessor. This avoids the problems you hit in the previous step but at the expense of significant complexity. At the end of the day, if you need to be right next to the data to safely change it, a cache is a bad architectural construct to start from.
Step 8: Hitting a wall
Now you’ve got something with the API of a cache, but it is very clearly not a cache. In fact, your requirements closely match what you’d need for an enterprise data storage platform.
This is a ‘good news’/’bad news’ situation.
The ‘bad news’ is that your cache has reached its natural architectural end state. The law of diminishing returns has kicked in and it’s questionable if each dollar of investment in the platform gets you a dollar of value.
The ‘good news’ is that you’re still using a standard and well-understood API, and switching platforms is possible.
This is where Volt may help.
Step 9: Volt
Volt is far more than a Java Cache. We can stand in for one, and we’ve got a running implementation of a JSR107-compliant cache you can try out.
While this isn’t necessarily the best way to deploy and use Volt, since it has some limitations, it does provide a way for people to quickly and easily evaluate Volt Active Data against workloads that are being run on JSR107-related variations on the theme of cache.
Contact us for further information.