Reliable event dispatching using a transactional outbox

Using events in a system is great, but how do you know for sure if you've reliably dispatched your events? The transportation of events needs to be done reliably while maintaining overall system consistency, be it eventual or immediately. In a typical setup, a database is used to store information and queues are used to send messages between processes and systems. Often, the events are dispatched directly to the queue in the same action that stores information in the data. This may not seem to be obviously wrong, but there is a potential problem with this approach.

Non-transactional dispatching

When events are dispatched directly to the queue, there are two network requests happening in a single request or operation, and that creates a problem. Two network requests can not be made atomic (to fail or succeed together).

To illustrate, here is a quick rundown of what might happen:

Store data first, then dispatch the message
- If storing the data fails the operation is failed, unfortunate but not causing inconsistency.
- If storing the data is OK but publishing events fails the outside world is not kept up to date.
Dispatch the message, then store the data (never do this)
- If dispatching the message fails the operation is failed, unfortunate but not causing inconsistency. But ask yourself, would you be OK with rolling back just because of this?
- If dispatching the event is OK but storing the data fails the outside world is AHEAD of the internal consistency (this is madness).

The computer science laws literally limit our options and we need to rethink our approach. Even though the likelihood of this happening can be reduced by using retries, you can never fully prevent it. At a certain scale, these kinds of inconsistencies can and will hurt you. Let's see how transactional outboxes can help.

Transactional dispatching

Our goal is to make the dispatching of events fail or succeed together with storing the application state. To do this, we need to make sure events are dispatched in the same operation used to store state in the database. This means we need to store the events in the database, in a buffer table. By doing so, we're effectively using the database as a queue. Now I can already hear you say "but using a database as a queue is bad", and generally I would agree with you. Using a database as a queue is generally not a good idea, so why is it OK in this case?

As with many software things, it’s all about the trade-off. While using a database for a queue is a bad default, the consistency it provides can be more important for your use-case. If we want to make sure we are not losing any messages, consistency is the most important aspect, justifying the trade-off.

It's still important that we limit the use of a database, which we can do by only using it to publish the messages. A regular queue is still the piece of infrastructure responsible for the actual delivery of the messages to the consumers. Queues are still far better for consumers, they provide more elaborate features, such as routing, retries, dead-lettering, the list goes on.

From a bird's-eye view, the entire process looks like this:

This setup, although it's more complex, is very reliable. We can guarantee that the messages or events are dispatched when, and only when, the state of the application is successfully stored.

Let's dig a little deeper to see what is required to make this work well for us.

Required setup

A good transactional outbox setup requires a couple of elements, let's go over each item.

A database that supports transactions
Most typical relational databases work fine; PostgreSQL, MySQL, MariaDB, pick your poison.
A buffer table for dispatched events
Inside your database, create a table that can store events. This table and the table that contains your application state need to be usable in the same transaction.
A relay mechanism
When events are published to the buffer table, they need to be consumed and forwarded to the queue. The component that does this is often referred to as a relay. You can either write a daemon script yourself, or use an off-the-shelf solution like Debezium.
An actual queue
The messages are consumed from the buffer table by the relay and finally published into the queue, for all interested parties to consume.

Tying it all together

Now we know what the different parts are, how do we get this all to work? Well, I'm glad you asked. In pseudo code, a simplified view of the workflow looks like this:

database.beginTransaction();

try {
    domainModelRepository.persist(domainModel);
    databaseOutbox.dispatch(events);
} catch (e) {
    database.rollback():
    throw e;
}

database.commitTransaction();

When performing regular database operations we use transactions to make multiple queries succeed or fail together. Each transaction wraps multiple statements and if any one of them fails, all operations are rolled back. For the outbox, this means the insertion of the events happens in the same database transaction as the queries to update the application state. By using a transaction, storing the state and dispatching the events will either succeed for both, or fail for both.

Great! Now we’ve got transactional dispatching!

Getting messages to the consumer

Of course, the messages need to end up with the consumer. A background process named a relay picks up the messages and relays (forwards) them to a queue. Consumers consume the messages from the queue as if nothing ever happened, but we know better 😜 !

Important implementation aspects

While implementing outboxes, it's important to keep a couple things in mind.

Outbox tables should have exclusive reads
An outbox table can be read from by multiple processes, this can cause duplication in messages for consumers as the same messages are relayed multiple times. While consumers should ideally be idempotent, this duplication causes the signal-to-noise ratio to go way up. To remedy this outbox relays should use a concurrency locking mechanism to prevent competing reads. Multiple consumers run in every availability zone to ensure reliability, while the locking mechanism prevents double relays.
Outbox tables are consumed in order
Relays consume outbox tables in order of input. If the relay is unable to relay a message the relay gets blocked. It is therefore important to make sure the relay is kept dumb, meaning it is only responsible for getting messages from point A to point B. Direct execution of application code in the relay process is an absolute no-go.

In the case of consumer failure, the queue is blocked until a remediation is deployed. This is fundamentally different when compared to a queue like RabbitMQ. RabbitMQ allows out-of-order consumption, allowing failed messages to be processed at a later point in time.
Outboxes have limited throughput
Compared to queues like RabbitMQ or streaming messaging platforms like Kafka, an outbox table has degraded throughput. You can increase throughput by adding a sharding mechanism to your outbox table. This way you can increase the number of relays to match the number of shards, this will increase throughput. Regardless of which approach, it's advised to use specific outboxes per business process, ensuring the throughput of each business process remains unaffected by that of another.
Without cleanup, outboxes grow infinitely
Outbox tables are buffer tables. Buffers can grow but should eventually shrink back down. Make sure to think about buffer cleanup, removing messages from it you no longer need. This can be done using a cronjob, or your relay can delete messages as they are consumed.
Outbox tables can be used for disaster recovery
If you configure your outbox to store messages for a longer time, you can leverage the investment of introducing one by re-using it for disaster recovery cases. Re-dispatching of events can bring consumers back up to date without the need for complicated cross-system synchronisation mechanisms.

Conclusion

The transactional outbox pattern is a hugely under-exposed and possibly under-valued pattern that solves a real-world problem. If you're working on a system as scale, use events, and care about consistency, you should probably be using an outbox! If you are an EventSauce user, checkout the message outbox documentation to see how outboxes can be used.

Hope you liked this post, and until next time!