The Use of Sagas by the Kinetic Engineering Team
One of our goals for our latest release is to make our forms and workflow more intuitive to work with. Previously, forms and workflow were designed to work independent of each other. They were defined in two distinct steps with a third step required to relate them. At the time of development, this was the best solution that leveraged available technologies that would give us the best outcome. Currently data that makes up the forms in our system is stored in the Cassandra database while data that defines and tracks the running of workflows is stored in a relational database, like Postgres.
While at times it is necessary, by adding a second database to a solution introduces complexities. These can be a challenge for the application developer, and if not handled properly, the end-user may start to notice the result of these complexities (latency being a top example). From the application developers perspective, the simplest solution would be to have both pieces of data (the forms and their workflow) stored in the same database that supports ACID transactions. ACID stands for Atomicity, Consistency, Isolation, and Durability. All four of these components are easier to manage on a single database. Our main focus is on atomicity, which is the guarantee that when a user adds or modifies workflows, we can attempt to make the changes to both tables and be rest assured that either the whole transaction will succeed or none of it will succeed.
Our problem is we have two databases, one for workflows and one for forms. Therefore, there is no atomicity. We perform each operation independently and account for the possibility that the first operation succeeds while the second one fails. This accounting is done by adding logic that looks for inconsistencies when retrieving workflows. Data from both databases is retrieved and compared. Any discrepancy is then reported with the resulting list of workflows. Additionally, an API function exists that allows a user to repair the inconsistencies. These accommodations add quite a bit of overhead to the implementation of the feature (lines of code, test cases, etc.) and there’s still the possibility the user needs to be involved in cleaning up the data.
TL;DR summary of this is that data consistency between multiple databases is hard to guarantee. We try, but we can do better.
In an attempt to improve our data consistency within multiple database environments, we found a pattern to help us solve these problems more effectively. The pattern, known as Sagas, was formally introduced in the following paper at Cornell. It has gained popularity recently despite the concept being dated back to 1987. This is due to the rise of microservice architecture. Microservice architecture introduces complexity to different applications that are responsible for a number of different pieces of business logic. Due to the division between different applications, the developer can no longer rely on ACID transactions.
To implement a solution using the Saga pattern, you first break your problem into the sequence of sub-transactions. The solution may have as few as two sub-transactions but there is no theoretical limit on the maximum number and they are referred to as T1, T2, …, Tn. On the happy path, before each step, we record an event in a log that says we are attempting sub-transaction Tn and only after that event is logged, do we attempt the sub-transaction. If the sub-transaction succeeds, then we record another event in the log, acknowledging the success.
The existence of this log in the Saga system removes a lot of state from the applications that need to execute the transactions. For instance, one application server can execute sub-transactions T1 through T3 and abruptly stop after acknowledgement of T3. Then another application starts by reading the contents of the log and can determine if it should resume at T4.
In the event that a sub-transaction fails, a negative acknowledgement is recorded. Once that event is recorded, we are presented with two options. The best case is that the action is idempotent, meaning that it can be retried with the same parameters without consequence. If it is, we would record an event marking the attempt to execute the sub-transaction and execute it again. If the sub-transaction is not idempotent, we need something called a compensating transaction. The paper above defines a compensating transaction as one that semantically undoes the actions performed by a sub-transaction. The topic of compensating transactions is robust enough of a topic to warrant its own discussion entirely, but for now, just imagine our sub-transaction Tn simply creates a record and the compensation Cn deletes that record.
While this summary skips some of the details included in the Sagas paper above, what we end up with is a system that guarantees that we either execute all sub-transactions successfully. Or for any sub-transactions that were executed until a failure occurred, we will have executed the compensating transactions of the previous steps. In a way, we have regained the atomicity we lost when moving towards a distributed system, either all of the sub-transactions will be applied or none of them will.
For additional context and breakdown on Sagas, I recommend this talk by Caitie McCaffrey.
At Kinetic Data, we have been experimenting with using the Saga pattern to solve problems like the one with our distributed forms and workflow. The current goal is to create a generic Saga execution engine that can be applied to any problem. In the end, each problem that deals with distributed data becomes an exercise of breaking up the problem into a series of sub-transactions and corresponding compensating transactions.
Something interesting to consider is how closely our challenge of storing data in two databases mirrors our customers’ challenge of trying to tie together the variety of systems that make up their business solution. The long term hope is that customers can benefit directly from the same Sagas system we use to stitch our application together by using it to stitch their own business applications together.
A lot of implementation details remain to be decided. However, we are really excited about the prospect of being able to improve how we keep data consistent in our distributed system while also making the application easier to work on and maintain.