How to handle publishing event when message broker is out? - domain-driven-design

I'm thinking how can I handle sending events when suddenly message broker go down. Please take a look at this code
using (var uow = uowProvider.Create())
{
...
...
var policy = offer.Buy(customer);
uow.Policies.Add(policy);
// DB changes are saved here! but what would happen if...
await uow.CommitChanges();
// ...eventPublisher throw an exception?
await eventPublisher.PublishMessage(PolicyCreated(policy));
return true;
}
IMHO if eventPublisher throw exception the event PolicyCreated won't be published. I don't know how to deal with this situation. The event must be published in system. I suppose that only good solution will be creating some kind of retry mechanism but I'm not sure...

I would like to elaborate a bit on the answers provided by both #Imran Arshad and #VoiceOfUnreason which are, of course, correct.
There are basically 3 patterns when it comes to publishing messages:
exactly once delivery (requires distributed transactions)
at most once delivery (no distributed transaction but may miss messages - like the actor model)
at least once delivery (no distributed transaction but may have duplicate messages)
The following is all in terms of your example.
For exactly once delivery both the database and the queue would need to provide the ability to enlist in distributed transactions. Some queues do not proivde this functionality out-of-the-box (like RabbitMQ) and even though it may be possible to roll your own it may not be the best option. Distributed transactions are typically quite slow.
For at most once delivery we have to accept that we may miss messages and I'm guessing that in most use-cases this is quite troublesome. You would get around this by tracking the progress and picking up the missed messages and resending them if required.
For at least once delivery we would need to ensure that the messages are idempotent. When we get a duplicate messages (usually quite an edge case) they should be ignored or their outcome should be the same as the initial message processed.
Now, there are a couple of ways around your issue. You could start a database transaction and make your database changes. Before you comit you perform the message sending. Should that fail then your transaction would be rolled back. That works fine for sending a single message but in your case some subscribers may have received a message. This complicates matters as all your subscribers need to receive the message or none of them get to receive it.
You could have your subscriber check whether the state is indeed true and whether it should continue processing. This places a burden on the subscriber and introduces some coupling. It could either postpone the action should the state not allow processing, or ignore it.
Another option is that instead of publishing the event you send yourself a command that indicates completion of the step. The command handler would perform the publishing and retry until all subscriber queues receive the message. This would require the relevant subscribers to ignore those messages that they had already processed (idempotence).
The outbox is a store-and-forward approach and will eventually send the message to all subscribers. You could have your outbox perhaps be included in the database transaction. In my Shuttle.Esb service bus one of the folks that used it came across a weird side-effect that I had not planned. He used a sql-based queue as an outbox and the queue connection was to the same database. It was therefore included in the database transasction and would roll back with all the other changes if not committed. Apologies for promoting my own product but I'm sure other service bus offerings may have the same functionality.
There are therefore quite a few things to consider and various techniques to mitigate the risk of a queue outage. I would, however, move the queue interaction to before the database commit.

For reliable system you need to save events locally. If your broker is down you have to retry and publish event.
There are many ways to achieve this but most common is outbox pattern. Just like your mail box your event/message stays locally and you keep retrying until it's sent and you mark the message published in your local DB.
you can read more about here Publish Events

You'll want to review Udi Dahan's discussion of Reliable Messaging without Distributed Transactions.
But very roughly, the PolicyCreated event becomes part of the unit of work; either because it is saved in the Policy representation itself, or because it is saved in an EventRepository that participates in the same transaction as the Policies repository.
Once you've captured the information in your database, retry the publish is relatively straight forward - read the events from the database, publish, optionally mark the events in the database as successfully published so that they can be cleaned up.

Related

Selecting one producer for multiple consumers

In a Producer-Consumer case with multiple app instances, I know I am supposed to have some type of queue for the distribution of events to the consumers. But how do I deal with the producer?
I must query a database for objects with an expired deadline every minute. That will push work to a message queue, so distribution is not a problem. My concern is that if I have multiple instances of the app, I have to make sure that only one is producing work.
Am I supposed to solve this electing a cluster leader? Is there a common algorithm or library in NodeJS for this? My guess is that I will have to reach for some magic Redis command and make my instances aware of each other.
There are always many different ways to achieve things, but my suggestion is to create an idempotent outbox table in your database, where multiple producers throw the records to be published to the message queue.
Then, you can deploy a tool like Debezium that does transaction log tailing (reads the database transaction log) and pushes the message to whatever message queue technology you're using.
Please note that it's also a good practice to implement the idempotency check on your consumers to make sure they don't process the same message twice.
Wix - How We Implemented Idempotency in a Billing System at Scale

Architecture issue - Azure servicebus and message order guarantee

Ok so i'm relatively new to the servicebus. Working on a project where we use Azure servicebus for queueing messages. Our architecture roughly looks like the following:
So the idea is that in our SourceSystem all kinds of stuff happens, which leads to messages being put on the servicebustopics. Now our responsibility is syncing these events to the external client so they are aware of what we are doing.
Now the issue is that currently we dont use servicebus sessions so message order isnt guaranteed. Also consider the following scenario:
OrderCreated
OrderUpdate 1
OrderUpdate 2
OrderClosed
What happens now is if the externalclients API is down for say OrderUpdate 1 and OrderUpdate 2, we could potentially send the messages in order: OrderCreated, OrderClosed, OrderUpdate 1, OrderUpdate 2.
Currently we just retry a message a few times and then it moves into the deadletter queue for manual reprocessing.
What steps should we take to better guarantee message order? I feel like in the scope of an order, message order needs to be guaranteed.
Should we force the sourcesystem to put all messages for a order in a servicebus session? But how can we handle this with multiple topics? And what do we do if message 1 from a session ends up in the deadletter?
There are a lot of considerations here, should we use a single topic so its easier to manage the sessions? But this opens up other problems with different message structures being in a single topic?
Id love to hear your opinions on this
Have a look at Durable Functions in Azure. You can use the 'Async Http API' or one of the other patterns to achieve the orchestration you need to do.
NServicebus' Sagas might also be a good option, here is an article that does a very good comparison between NServicebus and Durable Functions.
If the external client has to receive all those events and order matters, sending those messages to multiple topics where a topic is per message type will make your mission extremely hard to accomplish. For ordered messaging first you need to use a single entity (queue or topic) with Sessions enabled. That way you can guarantee ordered message processing. In case you have multiple external clients, you'd need to have a session-enabled entity (topic) per external client.
Another option is to implement a pattern known as Process Manager. The process manager would be responsible to make the decisions about the incoming messages and conclude when the work for a given order is completed or not.
There are also libraries (MassTransit, NServiceBus, etc) that can help you. NServiceBus implements Process Manager via a feature called Saga (tutorial) and MassTransit has it as well (documentation).

DomainEventPublisher consistency

Having just read Vaughn Vernon's effective aggregate design, I'm wondering about failures related to event publishing.
In the given example at page 9 (page 3 of the PDF), we call DomainEventPublisher.publish(). The event being published allows other aggregates to execute their behaviours.
What I'm wondering is: What happens if DomainEventPublisher.publish() fails ? What happens if DomainEventPublisher.publish() succeeds, but the transaction fails ?
How implementations handle these two cases ?
DomainEventPublisher.publish() is synchronous. You'd setup a generic handler (handles all events) which stores the events in the same database transaction as the business process, which means your event storage must have the ability to be transactionnal with whatever other storage mechanism you rely on to store the state of your aggregates.
Once events have been written on disk transactionnaly, you can then put them on a message queue for asynchronous delivery.
Are there other known ways to do it?
Well, rather than using a static DomainEventPublisher you could record events in a collection on the AR, just like in event sourcing and then implement a centralised mechanism to store them (e.g. transaction hooks, using aspects, etc.).
What happens if DomainEventPublisher.publish() succeeds, but the
transaction fails?
In this case I am against Vernon approach. I prefer to return the events to the application service. This way I can persist the changes performed by the aggregate using a transaction (if needed) and, if everything is Ok, I will publish the event. This also helps to keep the business layer entirely clean and pure.
In a few words; if the transaction fails then no event is raised.
What happens if DomainEventPublisher.publish() fails?
A domain event never fails, by business rules, because it's a notification of things that happened. If an aggregate said Yes to the operation and return a event expressing the business changes; then nothing in the world should say that this operation can not be done or has to be undone.
If the event fails by infrastructure then you need to have the tools to re-raise it (automatically or manually) when the outage is fixed and eventually archive the consistency in your system. Take a look at NServiceBus. It provides retries, error queues, logs and so on to never loose the events.
If the message system is down you have at least event logs that you can use to re-rise them into the message system.

Can domain events be deleted?

In order to make the domain event handling consistent, I want to persist the domain events to database while saving the AggregateRoot. later react to them using an event processor, for example let's say I want to send them to an event bus as integration events, I wonder whether or not the event is allowed to be deleted from the database after passing it through the bus?
So the events will never ever be loaded with the AggregateRoot root anymore.
I wonder whether or not the reactor is allowed to remove the event from db after the reaction.
You'll probably want to review Reliable Messaging Without Distributed Transactions, by Udi Dahan; also Pat Helland's paper Life Beyond Distributed Transactions.
In event sourced systems, meaning that the history of domain events is the persisted history of the aggregate, you will almost never delete events.
In a system where the log of domain events is simply a journal of messages to be communicated to other "partners": fundamentally, the domain events are messages that describe information to be copied from one part of the system to another. So when we get an acknowledgement that the message has been copied successfully, we can remove the copy stored "here".
In a system where you can't be sure that all of the consumers have received the domain event (because, perhaps, the list of consumers is not explicit), then you probably can't delete the domain events.
You may be able to move them -- which is, instead of having implicit subscriptions to the aggregate, you could have an explicit subscription from an event history to the aggregate, and then implicit subscriptions to the history.
You might be able to treat the record of domain events as a cache -- if the partner's aren't done consuming the message within 7 days of it being available, then maybe the delivery of the message isn't the biggest problem in the system.
How many nines of delivery guarantee do you need?
Domain events are things that have happened in the past. You can't delete the past, assuming you're not Martin McFly :)
Domain events shouldn't be deleted from event store. If you want to know whether you already processed it before, you can add a flag to know it.
UPDATE ==> DESCRIPTION OF EVENT MANAGEMENT PROCESS
I follow the approach of IDDD (Red Book by Vaughn Vernon, see picture on page 287) this way:
1) The aggregate publish the event locally to the BC (lightweight publisher).
2) In the BC, a lightweight subscriber store all the event published by the BC in an "event store" (which is a table in the same database of the BC).
3) A batch process (worker) reads the event store and publish the events to a message queue (or an event bus as you say).
4) Other BCs interested in the event (or even the same BC) subscribe to the message queue (or event bus) for listening and react to the event.
Anyway, even the worker had sent the event away ok to the message queue, you shouldn't delete the domain event from the event store. Instead simply dont send it again, but events are things that have happened and you cannot (should not) delete a thing that have occurred in the past.
Message queue or event bus are just a mechanism to send/receive events, but the events should remain stored in the BC they were created and published.

Domain driven design and domain events

I'm new to DDD and I'm reading articles now to get more information. One of the articles focuses on domain events (DE). For example sending email is a domain event raised after some criteria is met while executing piece of code.
Code example shows one way of handling domain events and is followed by this paragraph
Please be aware that the above code will be run on the same thread within the same transaction as the regular domain work so you should avoid performing any blocking activities, like using SMTP or web services. Instead, prefer using one-way messaging to communicate to something else which does those blocking activities.
My questions are
Is this a general problem in handling DE? Or it is just concern of the solution in mentioned article?
If domain events are raised in transaction and the system will not handle them synchronously, how should they be handled?
When I decide to serialize these events and let scheduler (or any other mechanism) execute them, what happens when transaction is rolled back? (in the article event is raised in code executed in transaction) who will cancel them (when they are not persisted to database)?
Thanks
It's a general problem period never mind DDD
In general, in any system which is required to respond in a performant manner (e.g. a Web Server, any long running activities should be handled asynchronously to the triggering process.
This means queue.
Rolling back your transaction should remove item from the queue.
Of course, you now need additional mechanisms to handle the situation where the item on the queue fails to process - i.e the email isn't sent - you also need to allow for this in your triggering code - having a subsequent process RELY on the earlier process having already occurred is going to cause issues at some point.
In short, your queueing mechanism should itself be transactional and allow for retries and you need to think about the whole chain of events as a workflow.

Resources