Design choice for a microservice event-driven architecture - domain-driven-design

Let's suppose we have the following:
DDD aggregates A and B, A can reference B.
A microservice managing A that exposes the following commands:
create A
delete A
link A to B
unlink A from B
A microservice managing B that exposes the following commands:
create B
delete B
A successful creation, deletion, link or unlink always results in the emission of a corresponding event by the microservice that performed the action.
What is the best way to design an event-driven architecture for these two microservices so that:
A and B will always eventually be consistent with each other. By consistency, I mean A should not reference B if B doesn't exist.
The events from both microservices can easily be projected in a separate read model on which queries spanning both A and B can be made
Specifically, the following examples could lead to transient inconsistent states, but consistency must in all cases eventually be restored:
Example 1
Initial consistent state: A exists, B doesn't, A is not linked to B
Command: link A to B
Example 2
Initial consistent state: A exists, B exists, A is linked to B
Command: delete B
Example 3
Initial consistent state: A exists, B exists, A is not linked to B
Two simultaneous commands: link A to B and delete B
I have two solutions in mind.
Solution 1
Microservice A only allows linking A to B if it has previously received a "B created" event and no "B deleted" event.
Microservice B only allows deleting B if it has not previously received a "A linked to B" event, or if that event was followed by a "A unlinked from B" event.
Microservice A listens to "B deleted" events and, upon receiving such an event, unlinks A from B (for the race condition in which B is deleted before it has received the A linked to B event).
Solution 2:
Microservice A always allows linking A to B.
Microservice B listens for "A linked to B" events and, upon receiving such an event, verifies that B exists. If it doesn't, it emits a "link to B refused" event.
Microservice A listens for "B deleted" and "link to B refused" events and, upon receiving such an event, unlinks A from B.
EDIT: Solution 3, proposed by Guillaume:
Microservice A only allows linking A to B if it has not previously received a "B deleted" event.
Microservice B always allows deleting B.
Microservice A listens to "B deleted" events and, upon receiving such an event, unlinks A from B.
The advantage I see for solution 2 is that the microservices don't need to keep track of of past events emitted by the other service. In solution 1, basically each microservice has to maintain a read model of the other one.
A potential disadvantage for solution 2 could maybe be the added complexity of projecting these events in the read model, especially if more microservices and aggregates following the same pattern are added to the system.
Are there other (dis)advantages to one or the other solution, or even an anti-pattern I'm not aware of that should be avoided at all costs?
Is there a better solution than the two I propose?
Any advice would be appreciated.

Microservice A only allows linking A to B if it has previously received a "B created" event and no "B deleted" event.
There's a potential problem here; consider a race between two messages, link A to B and B Created. If the B Created message happens to arrive first, then everything links up as expected. If B Created happens to arrive second, then the link doesn't happen. In short, you have a business behavior that depends on your message plumbing.
Udi Dahan, 2010
A microsecond difference in timing shouldn’t make a difference to core business behaviors.
A potential disadvantage for solution 2 could maybe be the added complexity of projecting these events in the read model, especially if more microservices and aggregates following the same pattern are added to the system.
I don't like that complexity at all; it sounds like a lot of work for not very much business value.
Exception Reports might be a viable alternative. Greg Young talked about this in 2016. In short; having a monitor that detects inconsistent states, and the remediation of those states, may be enough.
Adding automated remediation comes later. Rinat Abdullin described this progression really well.
The automated version ends up looking something like solution 2; but with separation of the responsibilities -- the remediation logic lives outside of microservice A and B.

Your solutions seem OK but there are some things that need to be clarified:
In DDD, aggregates are consistencies boundaries. An Aggregate is always in a consistent state, no matter what command it receives and if that command succeeds or not. But this does not mean that the whole system is in a permitted permanent state from the business point of view. There are moments when the system as whole is in a not-permitted state. This is OK as long as eventually it will transition in a permitted state. Here comes the Saga/Process managers. This is exactly their role: to bring the system in a valid state. They could be deployed as separate microservices.
One other type of component/pattern that I used in my CQRS projects are Eventually-consistent command validators. They validate a command (and reject it if it is not valid) before it reaches the Aggregate using a private read-model. These components minimize the situations when the system enters an invalid state and they complement the Sagas. They should be deployed inside the microservice that contains the Aggregate, as a layer on top of the domain layer (aggregate).
Now, back to Earth. Your solutions are a combination of Aggregates, Sagas and Eventually-consistent command validations.
Solution 1
Microservice A only allows linking A to B if it has previously received a "B created" event and no "B deleted" event.
Microservice A listens to "B deleted" events and, upon receiving such an event, unlinks A from B.
In this architecture, Microservice A contains Aggregate A and a Command validator and Microservice B contains Aggregate B and a Saga. Here is important to understand that the validator would not prevent the system's invalid state but only would reduce the probability.
Solution 2:
Microservice A always allows linking A to B.
Microservice B listens for "A linked to B" events and, upon receiving such an event, verifies that B exists. If it doesn't, it
emits a "link to B refused" event.
Microservice A listens for "B deleted" and "link to B refused" events and, upon receiving such an event, unlinks A from B.
In this architecture, Microservice A contains Aggregate A and a Saga and Microservice B contains Aggregate B and also a Saga. This solution could be simplified if the Saga on B would verify the existence of B and send Unlink B from A command to A instead of yielding an event.
In any case, in order to apply the SRP, you could extract the Sagas to their own microservices. In this case you would have a microservice per Aggregate and per Saga.

I will start with the same premise as #ConstantinGalbenu but follow with a different proposition ;)
Eventual consistency means that the whole system will eventually
converge to a consistent state.
If you add to that "no matter the order in which messages are received", you've got a very strong statement by which your system will naturally tend to an ultimate coherent state without the help of an external process manager/saga.
If you make a maximum number of operations commutative from the receiver's perspective, e.g. it doesn't matter if link A to B arrives before or after create A (they both lead to the same resulting state), you're pretty much there. That's basically the first bullet point of Solution 2 generalized to a maximum of events, but not the second bullet point.
Microservice B listens for "A linked to B" events and, upon receiving
such an event, verifies that B exists. If it doesn't, it emits a "link
to B refused" event.
You don't need to do this in a nominal case. You'd do it in the case where you know that A didn't receive a B deleted message. But then it shouldn't be part of your normal business process, that's delivery failure management at the messaging platform level. I wouldn't put this kind of systematic double-check of everything by the microservice where the original data came from, because things get way too complex. It looks as if you're trying to put some immediate consistency back into an eventually consistent setup.
That solution might not always be feasible, but at least from the point of view of a passive read model that doesn't emit events in response to other events, I can't think of a case where you couldn't manage to handle all events in a commutative way.

Related

CQRS Aggregate and Projection consistency

Aggregate can use View this fact is described in Vaughn Vernon's book:
Such Read Model Projections are frequently used to expose information to various clients (such as desktop and Web user interfaces), but they are also quite useful for sharing information between Bounded Contexts and their Aggregates. Consider the scenario where an Invoice Aggregate needs some Customer information (for example, name, billing address, and tax ID) in order to calculate and prepare a proper Invoice. We can capture this information in an easy-to-consume form via CustomerBillingProjection, which will create and maintain an exclusive instance of CustomerBilling-View. This Read Model is available to the Invoice Aggregate through the Domain Service named IProvideCustomerBillingInformation. Under the covers this Domain Service just queries the document store for the appropriate instance of the CustomerBillingView
Let's imagine our application should allow to create many users, but with unique names. Commands/Events flow:
CreateUser{Alice} command sent
UserAggregate checks UsersListView, since there are no users with name Alice, aggregate decides to create user and publish event.
UserCreated{Alice} event published // By UserAggregate
UsersListProjection processed UserCreated{Alice} // for simplicity let's think UsersListProjection just accumulates users names if receives UserCreated event.
CreateUser{Bob} command sent
UserAggregate checks UsersListView, since there are no users with name Bob, aggregate decides to create user and publish event.
UserCreated{Bob} event published // By UserAggregate
CreateUser{Bob} command sent
UserAggregate checks UsersListView, since there are no users with name Bob, aggregate decides to create user and publish event.
UsersListProjection processed UserCreated{Bob} .
UsersListProjection processed UserCreated{Bob} .
The problem is - UsersListProjection did not have time to process event and contains irrelevant data, aggregate used this irrelevant data. As result - 2 users with the same name created.
how to avoid such situations?
how to make aggregates and projections consistent?
how to make aggregates and projections consistent?
In the common case, we don't. Projections are consistent with the aggregate at some time in the past, but do not necessarily have all of the latest updates. That's part of the point: we give up "immediate consistency" in exchange for other (higher leverage) benefits.
The duplication that you refer to is usually solved a different way: by using conditional writes to the book of record.
In your example, we would normally design the system so that the second attempt to write Bob to our data store would fail because conflict. Also, we prevent duplicates from propagating by ensuring that the write to the data store happens-before any events are made visible.
What this gives us, in effect, is a "first writer wins" write strategy. The writer that loses the data race has to retry/fail/etc.
(As a rule, this depends on the idea that both attempts to create Bob write that information to the same place, using the same locks.)
A common design to reduce the probability of conflict is to NOT use the "read model" of the aggregate itself, but to instead use its own data in the data store. That doesn't necessarily eliminate all data races, but you reduce the width of the window.
Finally, we fall back on Memories, Guesses and Apologies.
It's important to remember in CQRS that every write model is also a read model for the reads that are required to validate a command. Those reads are:
checking for the existence of an aggregate with a particular ID
loading the latest version of an entire aggregate
In general a CQRS/ES implementation will provide that read model for you. The particulars of how that's implemented will depend on the implementation.
Those are the only reads a command-handler ever needs to perform, and if a query can be answered with no more than those reads, the query can be expressed as a command (e.g. GetUserByName{Alice}) which when handled does not emit events. The benefit of such read-only commands is that they can be strongly consistent because they are limited to a single aggregate. Not all queries, of course, can be expressed this way, and if the query can tolerate eventual consistency, it may not be worth paying the coordination tax for strong consistency that you typically pay by making it a read-only command. (Command handling limited to a single aggregate is generally strongly consistent, but there are cases, e.g. when the events form a CRDT and an aggregate can live in multiple datacenters where even that consistency is loosened).
So with that in mind:
CreateUser{Alice} received
user Alice does not exist
persist UserCreated{Alice}
CreateUser{Alice} acknowledged (e.g. HTTP 200, ack to *MQ, Kafka offset commit)
UserListProjection updated from UserCreated{Alice}
CreateUser{Bob} received
user Bob does not exist
persist UserCreated{Bob}
CreateUser{Bob} acknowledged
CreateUser{Bob} received
user Bob already exists
command-handler for an existing user rejects the command and persists no events (it may log that an attempt to create a duplicate user was made)
CreateUser{Bob} ack'd with failure (e.g. HTTP 401, ack to *MQ, Kafka offset commit)
UserListProjection updated from UserCreated{Bob}
Note that while the UserListProjection can answer the question "does this user exist?", the fact that the write-side can also (and more consistently) answer that question does not in and of itself make that projection superfluous. UserListProjection can also answer questions like "who are all of the users?" or "which users have two consecutive vowels in their name?" which the write-side cannot answer.

DDD - How to modify several AR (from different bounded contexts) throughout single request?

I would want expose a little scenario which is still at paper state, and which, regarding DDD principle seem a bit tedious to accomplish.
Let's say, I've an application for hosting accounts management. Basically, the application compose several bounded contexts such as Web accounts management, Ftp accounts management, Mail accounts management... each of them represented by their own AR (they can live standalone).
Now, let's imagine I want to provide a UI with an HTML form that compose one fieldset for each bounded context, for instance to update limits and or features. How should I process exactly to update all AR without breaking single transaction per request principle? Can I create a kind of "outer" AR, let's say a ClientHostingProperties AR which would holds references to other AR and update them as part of single transaction, using own repository? Or should I better create an AR that emit messages to let's listeners provided by the bounded contexts react on, in which case, I should probably think about ES?
Thanks.
How should I process exactly to update all AR without breaking single transaction per request principle?
You are probably looking for a process manager.
Basic sketch: persisting the details from the submitted form is a transaction unto itself (you are offered an opportunity to accrue business value; step 1 is to capture that opportunity).
That gives you a way to keep track of whether or not this task is "done": you compare the changes in the task to the state of the system, and fire off commands (to run in isolated transactions) to make changes.
Processes, in my mind, end up looking a lot like state machines. These tasks are commands are done, these commands are not done, these commands have failed: now what? and eventually reach a state where there are no additional changes to be made, and this instance of the process is "done".
Short answer: You don't.
An aggregate is a transactional boundary, which means that if you would update multiple aggregates in one "action", you'd have to use multiple transactions. The reason for an aggregate to be equivalent to one transaction is that this allows you to guarantee consistency.
This means that you have two options:
You can make your aggregate larger. Then you can actually guarantee consistency, but your ability to handle concurrent requests gets worse. So this is usually what you want to avoid.
You can live with the fact that it's two transactions, which means you are eventually consistent. If so, you usually use something such as a process manager or a flow to handle updating multiple aggregates. In its simplest form, a flow is nothing but a simple if this event happens, run that command rule. In its more complex form, it has its own state.
Hope this helps 😊

How to avoid concurrency on aggregates status using Rebus in a server cluster

I have a web service that use Rebus as Service Bus.
Rebus is configured as explained in this post.
The web service is load balanced with a two servers cluster.
These services are for a production environment and each production machine sends commands to save the produced quantities and/or to update its state.
In the BL I've modelled an Aggregate Root for each machine and it executes the commands emitted by the real machine. To preserve the correct status, the Aggregate needs to receive the commands in the same sequence as they were emitted, and, since there is no concurrency for that machine, that is the same order they are saved on the bus.
E.G.: the machine XX sends a command of 'add new piece done' and then the command 'Set stop for maintenance'. Executing these commands in a sequence you should have Aggregate XX in state 'Stop', but, with multiple server/worker roles, you could have that both commands are executed at the same time on the same version of Aggregate. This means that, depending on who saves the aggregate first, I can have Aggregate XX with state 'Stop' or 'Producing pieces' ... that is not the same thing.
I've introduced a Service Bus to add scale out as the number of machine scales and resilience (if a server fails I have only slowdown in processing commands).
Actually I'm using the name of the aggregate like a "topic" or "destinationAddress" with the IAdvancedApi, so the name of the aggregate is saved into the recipient of the transport. Then I've created a custom Transport class that:
1. does not remove the messages in progress but sets them in state
InProgress.
2. to retrive the messages selects only those that are in a recipient that have no one InProgress.
I'm wandering: is this the best way to guarantee that the bus executes the commands for aggregate in the same sequence as they arrived?
The solution would be have some kind of locking of your aggregate root, which needs to happen at the data store level.
E.g. by using optimistic locking (probably implemented with some kind of revision number or something like that), you would be sure that you would never accidentally overwrite another node's edits.
This would allow for your aggregate to either
a) accept the changes in either order (which is generally preferable – makes your system more tolerant), or
b) reject an invalid change
If the aggregate rejects the change, this could be implemented by throwing an exception. And then, in the Rebus handler that catches this exception, you can e.g. await bus.Defer(TimeSpan.FromSeconds(5), theMessage) which will cause it to be delivered again in five seconds.
You should never rely on message order in a service bus / queuing / messaging environment.
When you do find yourself in this position you may need to re-think your design. Firstly, a service bus is most certainly not an event store and attempting to use it like one is going to lead to pain and suffering :) --- not that you are attempting this but I thought I'd throw it in there.
As for your design, in order to manage this kind of state you may want to look at a process manager. If you are not generating those commands then even this will not help.
However, given your scenario it seems as though the calls are sequential but perhaps it is just your example. In any event, as mookid8000 said, you either want to:
discard invalid changes (with the appropriate feedback),
allow any order of messages as long as they are valid,
ignore out-of-sequence messages till later.
Hope that helps...
"exactly the same sequence as they were saved on the bus"
Just... why?
Would you rely on your HTTP server logs to know which command actually reached an aggregate first? No because it is totally unreliable, just like it is with at-least-one delivery guarantees and it's also irrelevant.
It is your event store and/or normal persistence state that should be the source of truth when it comes to knowing the sequence of events. The order of commands shouldn't really matter.
Assuming optimistic concurrency, if the aggregate is not allowed to transition from A to C then it should guard this invariant and when a TransitionToStateC command will hit it in the A state it will simply get rejected.
If on the other hand, A->C->B transitions are valid and that is the order received by your aggregate well that is what happened from the domain perspective. It really shouldn't matter which command was published first on the bus, just like it doesn't matter which user executed the command first from the UI.
"In my scenario the calls for a specific aggregate are absolutely
sequential and I must guarantee that are executed in the same order"
Why are you executing them asynchronously and potentially concurrently by publishing on a bus then? What you are basically saying is that calls are sequential and cannot be processed concurrently. That means everything should be synchronous because there is no potential benefit from parallelism.
Why:
executeAsync(command1)
executeAsync(command2)
executeAsync(command3)
When you want:
execute(command1)
execute(command2)
execute(command3)
You should have a single command message and the handler of this message executes multiple commands against the aggregate. Then again, in this case I'd just create a single operation on the aggregate that performs all the transitions.

What are the best practices to introduce a new BC to a DDD app?

This is a theoretical question about the introduction of new BCs in a system we use ES and CQRS with DDD. So there won't be concrete examples.
There can be interesting problems by introducing new BC-s, which communicate with the old ones by receiving and publishing domain events. The root of these problems that we already have domain events in the event storage. When the new BC reacts on those old domain events it will do that in a way which is out of sync and/or out of sequence.
For example we have an old BC A and we introduce a new BC B. Both publish domain events which we call a and b. In the new system the order matters for example b1 must always come after a1, but before a2. What can we do, when we already have the a1, a2, a3 sequence in the event storage? Should we inject b1 after a1 and so on? Is this a viable solution by a huge event storage? It will certainly take a long time to replay all the old events one by one and react on them. How can we prevent sending an email to the customer by handling the newly created b1 event, which reacts on a 5 years old topic? Is there a pattern to prevent these kind of problems?
Problem Analysis
The root of these problems that we already have domain events in the event storage.
If you introduce a new BC B to an existing system, that means the system was functional without B. This is clear by the above statement and has the following consequences:
Events that B would have produced in response to events from A do not need to be published. No other system should take action based on these events, because they are artificial.
You can go live with B at any time you choose. The only thing that you need to do beforehand is getting B in sync with the current state of the system.
Getting B in Sync
This is not difficult if you design B accordingly.
First, you need a replay mode mechanism to import all domain events into B without publishing events from B in response. You need to keep Bs events internally of course if you use event sourcing, but do not publish them. Also, make sure B does not modify the state of the world while in replay mode by other means, e.g. don't send emails.
Then, switch B over to live mode. Now B consumes the new events from the system and also publishes its own.
The problem you mention with event ordering is only a problem when you use a unified event store for all domain events, and also use that store to publish events from. If this is the case, then you need to mark Bs events as "internal" during the replay phase and hide them from the publishing mechanism.
Note: If B is a purely reactive BC (this could be the case for a very simple BC), then you don't even need the replay stuff. But most BC's probably do.
First of all DDD does not require Event sourcing.
we have an old BC A and we introduce a new BC B. Both publish domain
events which we call a and b. In the new system the order matters for
example b1 must always come after a1, but before a2.
Events can be out of order, even in the same component(bounded-context). Transactional integrity is only guaranteed within aggregates.
when we already have the a1, a2, a3 sequence in the event storage?
Doesn't matter. By the way you don't have this guarantee with SQL databases unless you work in SERIALIZABLE isolation (or its vendor specific equivalent). Protip: It's so taxing on performance that it's never enabled by default; therefore you are not using it.
Pay special attention to this part in the above link:
Other transactions cannot insert new rows with key values that would
fall in the range of keys read by any statements in the current
transaction until the current transaction completes.
Furthermore, though an event store shouldn't have multiple copies of an event, events (and other messages such as commands) may arrive multiple times between components.
Should we inject b1 after a1 and so on?
Since your components should be able to handle out of order (and duplicate events) no
What can we do,
Depending on the technology used to integrate components, and the semantics of the messages:
If you are reading events from a web service, feed, DB table; such that it never goes away; you might be able to ignore an event until it is relevant.
Equivalently, you might be able to put an event back to the message queue it came from until it is relevant.
You may use the pattern known as Saga/Process Manager.
Is there a real race condition, at all?

Domain Driven Design - Atomic transaction across multiple bounded context

In DDD, I understand that Events can decouple the Bounded Contexts when they communicate with each others. Assume an atomic transaction contains two database operations on seperated bounded contexts A and B. When operation on A finishes it sends and event which is handled by B which finishes second operation. However, how does operation on A rolls back if operation on B failed?
For example, I am currently designing a system using Domain Driven Design. It contains a Membership and an Inventory bounded contexts. In order to decouple the contexts, I use Events: when an order is being paid, Inventory context will reduce the quantity of the sold product, and send an Product_Sold event. the event is then handled by Membership context which then substracts the user's balance based on the price of the sold product.
However if the user balance update failed due to database failure, how does Inventory context know it so that it can roll back the previously reduced product quantity?
There's actually a pattern for this called Saga.
http://vasters.com/clemensv/2012/09/01/Sagas.aspx
http://nservicebus.com/Sagas.aspx
As you use events to communicate between contexts, simply publish the Product_NotSold and roll back the transaction when you get this event.
However, you cannot provide 'atomic' transaction in this way. It more a long running process (a.k.a. saga). If you really want atomicity, you need to use two-phase commit and abandon events.

Resources