DDD handling Aggregate updates over time - domain-driven-design

Using Event Sourcing, I have a domain in which aggregates should be updated from time to time. When I create an aggregate, I have an expiry time (this can be arbitrary) on it, and after that time I have to update some properties of the entity. (This can be forced using an UpdateCommand too.) I have few processes in mind:
After the aggregate creation, I store the aggregate ID and the expiry time in an RDBMS.
In a cron job I query the database for expired aggregates, and submit an UpdateCommand
Others include emitting UpdateCommands (or events?) from the read side.
Using a saga to coordinate updates, this is similar to the first. But either way, I have to store the expiry times.
So, I have to store the events and write into a database on the write side transactionally. However, I am not sure if creating a read-side for the write-side (?) is the correct solution in the DDD world, or is it applicable? What are the recommended solutions?

I also need to run some commands after some time expires.
For example, I need to emit a ContractExpiredEvent after 1 year (the ContractAggregate decides when but usually it is 1 year). The problem is that the Aggregate must be the one that decides when and what command to executes, so this is a Domain concern more than an Infrastructure one.
How I did that? I was inspired by Udi Dahan's video in which he introduce the term Timeout. Long story short, the Aggregate requests that a command should be send to itself after a period of time passes. It does that by yielding it from a command handler. The underlying CQRS framework gets that scheduled command and persists it in a special repository. Then, a cron job process all scheduled commands when their time comes.

There's well compatibility between ES and DDD.
However, I am not sure if creating a read-side for the write-side (?) is the correct solution in the DDD world, or is it applicable?
Yes, it's a part of domain aggregate in your case (if you talk about storing expiry times on write-side).
So, I have to store the events and write into a database on the write side transactionally.
I suggest you to use the saga for writing into a db.

John Carmack, 1998:
If you don't consider time an input value, think about it until you do -- it is an important concept
The pattern you should be looking for is that the real world (where time is) tells the aggregate the current time, and the aggregate decides whether or not to expire itself.
With that pattern in place, you can use any strategy you like for scheduling when the real world tells the aggregate what time it is.
You don't need immediately consistent scheduling in the aggregate, you just need some idempotent message handling and an "at least once" delivery process.
the aggregate has a method which can cause an update if it is necessary based on the current time, not blindly. At some time I have to fetch the right aggregate from the store, call that method and store the changes back (if any), or retry later, right?
Yes, that's the right idea.
Notice that if you call that method twice after the expiration time, the first call will load the history, append the expiration events, and store the updated history. The second call loads the history, can see that the aggregate is already expired, and retires without making any change to the history.

You can also use bi-temporal event sourcing. When events are stored, there are two dates:
the date when the event is added to the database (createdAt)
the date when the event has to be applied (validFrom)
The events are then applied in the order defined by validFrom property.
Using this, you can:
"fix the past" by adding a new event (createdAt = now and validFrom = now - x)
schedule events in the future by adding a new event (createdAt = now and validFrom = now + y)
I suggest to watch this great video of Thomas Pierrain at DDD Europe 2018: https://www.youtube.com/watch?v=xzekp1RuZbM

Related

CQRS Aggregate and Projection consistency

Aggregate can use View this fact is described in Vaughn Vernon's book:
Such Read Model Projections are frequently used to expose information to various clients (such as desktop and Web user interfaces), but they are also quite useful for sharing information between Bounded Contexts and their Aggregates. Consider the scenario where an Invoice Aggregate needs some Customer information (for example, name, billing address, and tax ID) in order to calculate and prepare a proper Invoice. We can capture this information in an easy-to-consume form via CustomerBillingProjection, which will create and maintain an exclusive instance of CustomerBilling-View. This Read Model is available to the Invoice Aggregate through the Domain Service named IProvideCustomerBillingInformation. Under the covers this Domain Service just queries the document store for the appropriate instance of the CustomerBillingView
Let's imagine our application should allow to create many users, but with unique names. Commands/Events flow:
CreateUser{Alice} command sent
UserAggregate checks UsersListView, since there are no users with name Alice, aggregate decides to create user and publish event.
UserCreated{Alice} event published // By UserAggregate
UsersListProjection processed UserCreated{Alice} // for simplicity let's think UsersListProjection just accumulates users names if receives UserCreated event.
CreateUser{Bob} command sent
UserAggregate checks UsersListView, since there are no users with name Bob, aggregate decides to create user and publish event.
UserCreated{Bob} event published // By UserAggregate
CreateUser{Bob} command sent
UserAggregate checks UsersListView, since there are no users with name Bob, aggregate decides to create user and publish event.
UsersListProjection processed UserCreated{Bob} .
UsersListProjection processed UserCreated{Bob} .
The problem is - UsersListProjection did not have time to process event and contains irrelevant data, aggregate used this irrelevant data. As result - 2 users with the same name created.
how to avoid such situations?
how to make aggregates and projections consistent?
how to make aggregates and projections consistent?
In the common case, we don't. Projections are consistent with the aggregate at some time in the past, but do not necessarily have all of the latest updates. That's part of the point: we give up "immediate consistency" in exchange for other (higher leverage) benefits.
The duplication that you refer to is usually solved a different way: by using conditional writes to the book of record.
In your example, we would normally design the system so that the second attempt to write Bob to our data store would fail because conflict. Also, we prevent duplicates from propagating by ensuring that the write to the data store happens-before any events are made visible.
What this gives us, in effect, is a "first writer wins" write strategy. The writer that loses the data race has to retry/fail/etc.
(As a rule, this depends on the idea that both attempts to create Bob write that information to the same place, using the same locks.)
A common design to reduce the probability of conflict is to NOT use the "read model" of the aggregate itself, but to instead use its own data in the data store. That doesn't necessarily eliminate all data races, but you reduce the width of the window.
Finally, we fall back on Memories, Guesses and Apologies.
It's important to remember in CQRS that every write model is also a read model for the reads that are required to validate a command. Those reads are:
checking for the existence of an aggregate with a particular ID
loading the latest version of an entire aggregate
In general a CQRS/ES implementation will provide that read model for you. The particulars of how that's implemented will depend on the implementation.
Those are the only reads a command-handler ever needs to perform, and if a query can be answered with no more than those reads, the query can be expressed as a command (e.g. GetUserByName{Alice}) which when handled does not emit events. The benefit of such read-only commands is that they can be strongly consistent because they are limited to a single aggregate. Not all queries, of course, can be expressed this way, and if the query can tolerate eventual consistency, it may not be worth paying the coordination tax for strong consistency that you typically pay by making it a read-only command. (Command handling limited to a single aggregate is generally strongly consistent, but there are cases, e.g. when the events form a CRDT and an aggregate can live in multiple datacenters where even that consistency is loosened).
So with that in mind:
CreateUser{Alice} received
user Alice does not exist
persist UserCreated{Alice}
CreateUser{Alice} acknowledged (e.g. HTTP 200, ack to *MQ, Kafka offset commit)
UserListProjection updated from UserCreated{Alice}
CreateUser{Bob} received
user Bob does not exist
persist UserCreated{Bob}
CreateUser{Bob} acknowledged
CreateUser{Bob} received
user Bob already exists
command-handler for an existing user rejects the command and persists no events (it may log that an attempt to create a duplicate user was made)
CreateUser{Bob} ack'd with failure (e.g. HTTP 401, ack to *MQ, Kafka offset commit)
UserListProjection updated from UserCreated{Bob}
Note that while the UserListProjection can answer the question "does this user exist?", the fact that the write-side can also (and more consistently) answer that question does not in and of itself make that projection superfluous. UserListProjection can also answer questions like "who are all of the users?" or "which users have two consecutive vowels in their name?" which the write-side cannot answer.

How to process Read Model in CQRS

We want to implement cqrs in our new design. We have some doubts in processing command handler and read model. We got understand that while processing commands we should take optimistic lock on aggregateId. But what approach should be considered while processing readModels. Should we take lock on entire readModel or on aggregateId or never take lock while processing read model.
case 1. when take lock on entire readmodel -> it is safest but is not good in term of speed.
case 2 - take lock on aggregateId. Here two issues may arise. if we take lock aggregateId wise -> then what if read model server restarts. It does not know from where it starts again.
case 3 - Never take lock. in ths approach, I think data may be in corrputed state. For eg say an order inserted event is generated and thorugh some workflow/saga, order updated event took place as well. what if order updated event comes first and order inserted event is not yet processed ?
Hope I am able to address my issue.
If you do not process events concurrently in the Readmodel then there is no need for a lock. This is the case when you have a single instance of the Readmodel, possible in a Microservice, that poll for events and process them sequentially.
If you have a synchronous Readmodel (i.e. in the same process as the Writemodel/Aggregate) then most probably you will need locking.
An important thing to keep in mind is that a Readmodel most probably differs from the Writemodel. There could be a lot of Writemodel types whos events are projected in the same Readmodel. For example, in an ecommerce shop you could have a ListOfProducts that projects event from Vendor and from Product Aggregates. This means that, when we speak about a Readmodel we cannot simply refer to the "Aggregate" because there is not single Aggregate involved. In the case of ecommerce, when we say "the Aggregate" we might refer to the Product Aggregate or Vendor Aggregate.
But what to lock? Here depends on the database technology. You should lock the smallest affected read entity or collection that can be locked. In a Readmodel that consist of a list of products (read entities, not aggregates!), when an event that affects only one product you should lock only that product (i.e. ProductTitleRenamed).
If an event affects more products then you should lock the entire collection. For example, VendorWasBlocked affects all the products (it should remove all the products from that vendor).
You need the locking for the events that have non-idempotent side effects, for the case where the Readmodel's updater fails during the processing of an event, if you want to retry/resume from where it left. If the event has idempotent side effects then it can be retried safely.
In order to know from where to resume in case of a failed Readmodel, you could store inside the Readmodel the sequence of the last processed event. In this case, if the entity update succeeds then the last processed event's sequence is also saved. If it fails then you know that the event was not processed.
For eg say an order inserted event is generated and thorugh some workflow/saga, order updated event took place as well. what if order updated event comes first and order inserted event is not yet processed ?
Read models are usually easier to reason about if you think about them polling for ordered sequences of events, rather than reacting to unordered notifications.
A single read model might depend on events from more than one aggregate, so aggregate locking is unlikely to be your most general answer.
That also means, if we are polling, that we need to keep track of the position of multiple streams of data. In other words, our read model probably includes meta data that tells us what version of each source was used.
The locking is likely to depend on the nature of your backing store / cache. But an optimistic approach
read the current representation
compute the new representation
compare and swap
is, again, usually easy to reason about.

How are the missing events replayed?

I am trying to learn more about CQRS and Event Sourcing (Event Store).
My understanding is that a message queue/bus is not normally used in this scenario - a message bus can be used to facilitate communication between Microservices, however it is not typically used specifically for CQRS. However, the way I see it at the moment - a message bus would be very useful guaranteeing that the read model is eventually in sync hence eventual consistency e.g. when the server hosting the read model database is brought back online.
I understand that eventual consistency is often acceptable with CQRS. My question is; how does the read side know it is out of sync with the write side? For example, lets say there are 2,000,000 events created in Event Store on a typical day and 1,999,050 are also written to the read store. The remaining 950 events are not written because of a software bug somewhere or because the server hosting the read model is offline for a few secondsetc. How does eventual consistency work here? How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
I have read questions on here over the last week or so, which talk about messages being replayed from event store e.g. this one: CQRS - Event replay for read side, however none talk about how this is done. Do I need to setup a scheduled task that runs once per day and replays all events that were created since the date the scheduled task last succeeded? Is there a more elegant approach?
I've used two approaches in my projects, depending on the requirements:
Synchronous, in-process Readmodels. After the events are persisted, in the same request lifetime, in the same process, the Readmodels are fed with those events. In case of a Readmodel's failure (bug or catchable error/exception) the error is logged and that Readmodel is just skipped and the next Readmodel is fed with the events and so on. Then follow the Sagas, that may generate commands that generate more events and the cycle is repeated.
I use this approach when the impact of a Readmodel's failure is acceptable by the business, when the readiness of a Readmodel's data is more important than the risk of failure. For example, they wanted the data immediately available in the UI.
The error log should be easily accessible on some admin panel so someone would look at it in case a client reports inconsistency between write/commands and read/query.
This also works if you have your Readmodels coupled to each other, i.e. one Readmodel needs data from another canonical Readmodel. Although this seems bad, it's not, it always depends. There are cases when you trade updater code/logic duplication with resilience.
Asynchronous, in-another-process readmodel updater. This is used when I use total separation of the Readmodel from the other Readmodels, when a Readmodel's failure would not bring the whole read-side down; or when a Readmodel needs another language, different from the monolith. Basically this is a microservice. When something bad happens inside a Readmodel it necessary that some authoritative higher level component is notified, i.e. an Admin is notified by email or SMS or whatever.
The Readmodel should also have a status panel, with all kinds of metrics about the events that it has processed, if there are gaps, if there are errors or warnings; it also should have a command panel where an Admin could rebuild it at any time, preferable without a system downtime.
In any approach, the Readmodels should be easily rebuildable.
How would you choose between a pull approach and a push approach? Would you use a message queue with a push (events)
I prefer the pull based approach because:
it does not use another stateful component like a message queue, another thing that must be managed, that consume resources and that can (so it will) fail
every Readmodel consumes the events at the rate it wants
every Readmodel can easily change at any moment what event types it consumes
every Readmodel can easily at any time be rebuild by requesting all the events from the beginning
there order of events is exactly the same as the source of truth because you pull from the source of truth
There are cases when I would choose a message queue:
you need the events to be available even if the Event store is not
you need competitive/paralel consumers
you don't want to track what messages you consume; as they are consumed they are removed automatically from the queue
This talk from Greg Young may help.
How does the application know to replay the 950 events that are missing at the end of the day or the x events that were missed because of the downtime ten minutes ago?
So there are two different approaches here.
One is perhaps simpler than you expect - each time you need to rebuild a read model, just start from event 0 in the stream.
Yeah, the scale on that will eventually suck, so you won't want that to be your first strategy. But notice that it does work.
For updates with not-so-embarassing scaling properties, the usual idea is that the read model tracks meta data about stream position used to construct the previous model. Thus, the query from the read model becomes "What has happened since event #1,999,050"?
In the case of event store, the call might look something like
EventStore.ReadStreamEventsForwardAsync(stream, 1999050, 100, false)
Application doesn't know it hasn't processed some events due to a bug.
First of all, I don't understand why you assume that the number of events written on the write side must equal number of events processed by read side. Some projections may subscribe to the same event and some events may have no subscriptions on the read side.
In case of a bug in projection / infrastructure that resulted in a certain projection being invalid you might need to rebuild this projection. In most cases this would be a manual intervention that would reset the checkpoint of projection to 0 (begining of time) so the projection will pick up all events from event store from scratch and reprocess all of them again.
The event store should have a global sequence number across all events starting, say, at 1.
Each projection has a position tracking where it is along the sequence number. The projections are like logical queues.
You can clear a projection's data and reset the position back to 0 and it should be rebuilt.
In your case the projection fails for some reason, like the server going offline, at position 1,999,050 but when the server starts up again it will continue from this point.

EventSourcing race condition

Here is the nice article which describes what is ES and how to deal with it.
Everything is fine there, but one image is bothering me. Here it is
I understand that in distributed event-based systems we are able to achieve eventual consistency only. Anyway ... How do we ensure that we don't book more seats than available? This is especially a problem if there are many concurrent requests.
It may happen that n aggregates are populated with the same amount of reserved seats, and all of these aggregate instances allow reservations.
I understand that in distributes event-based systems we are able to achieve eventual consistency only, anyway ... How to do not allow to book more seats than we have? Especially in terms of many concurrent requests?
All events are private to the command running them until the book of record acknowledges a successful write. So we don't share the events at all, and we don't report back to the caller, without knowing that our version of "what happened next" was accepted by the book of record.
The write of events is analogous to a compare-and-swap of the tail pointer in the aggregate history. If another command has changed the tail pointer while we were running, our swap fails, and we have to mitigate/retry/fail.
In practice, this is usually implemented by having the write command to the book of record include an expected position for the write. (Example: ES-ExpectedVersion in GES).
The book of record is expected to reject the write if the expected position is in the wrong place. Think of the position as a unique key in a table in a RDBMS, and you have the right idea.
This means, effectively, that the writes to the event stream are actually consistent -- the book of record only permits the write if the position you write to is correct, which means that the position hasn't changed since the copy of the history you loaded was written.
It's typical for commands to read event streams directly from the book of record, rather than the eventually consistent read models.
It may happen that n-AggregateRoots will be populated with the same amount of reserved seats, it means having validation in the reserve method won't help, though. Then n-AggregateRoots will emit the event of successful reservation.
Every bit of state needs to be supervised by a single aggregate root. You can have n different copies of that root running, all competing to write to the same history, but the compare and swap operation will only permit one winner, which ensures that "the" aggregate has a single internally consistent history.
There are going to be a couple of ways to deal with such a scenario.
First off, an event stream would have the current version as the version of the last event added. This means that when you would not, or should not, be able to persist the event stream if the event stream is not at the version when loaded. Since the very first write would cause the version of the event stream to be increased, the second write would not be permitted. Since events are not emitted, per se, but rather a result of the event sourcing we would not have the type of race condition in your example.
Well, if your commands are processed behind a queue any failures should be retried. Should it not be possible to process the request you would enter the normal "I'm sorry, Dave. I'm afraid I can't do that" scenario by letting the user know that they should try something else.
Another option is to start the processing by issuing an update against some table row to serialize any calls to the aggregate. Probably not the most elegant but it does cause a system-wide block on the processing.
I guess, to a large extent, one cannot really trust the read store when it comes to transactional processing.
Hope that helps :)

Why limit commands and events to one aggregate? CQRS + ES + DDD

Please explain why modifying many aggregates at the same time is a bad idea when doing CQRS, ES and DDD. Is there any situations where it still could be ok?
Take for example a command such as PurgeAllCompletedTodos. I want this command to lead to one event that update the state of each completed Todo-aggregate by setting IsActive to false.
Why is this not good?
One reason I could think of:
When updating the domain state it's probably good to limit the transaction to a well defined part of the entire state so that only this part need to be write locked during the update. Doing so would allow many writes on different aggregates in parallell which could boost performance in some extremely heavy scenarios.
The response of the question lie in the meaning of "aggregate".
As first thing I would say that you are not modifying 'n' aggregates, but you are modifying 'n' entities.
An aggregate contains more-than-one entity and it is just a transaction concept, the aggregate (pattern) is used when you need to modify the state of more than one entity in your application transactionally (all are modified or none).
Now, why you would modify more than one aggregate with one command?
If you feel this needs, before doing anything else check your aggregate boundaries to see if you can modify it to remove the needs to 1 command -> 'n' aggregate.
An aggregate can contains a lot of entities of the same type, so for your command PurgeAllCompletedTodos, you could also think about expand the transaction boundary from a single Todo to an aggregate UserTodosAggregate that contains all the user todos, and let it manage all the commands for the todos of a single user.
In this way you can modify all the todos of a user in a single transaction.
If this still doesn't solve your problem because, let's say that is needed to purge all completed todos of each user in the application, you will still need to send a command to 'n' aggregates, the aggregate boundary doesn't help, so we can think of having an AllApplicationTodosAggregate that manage the command.
Probably this isn't the best solution, because as you said it that command would block ALL the todos of the application, but, always check if it can be a good trade off (this part of the blocking is explained very well in both Blue Book and Red Book of DDD).
What if I need to modify some entities and can't have them in a single aggregate?
With the previous said, a command that modify more than one aggregate is bad because of transactions. What if you modify 3 aggregate, the first is good, and then the server is shut down?
In this case what you are doing is having a lot of single modification that needs to be managed to prevent inconsistency of the system.
It can be done using a process manager, whom responsabilities are modify all the aggregates sending them the right command and manage failures if they happen.
An aggregate still receive it's own command, but the process manager is in charge to send them in a way it knows (one at time, all in parallel, 5 per time, what-do-you-want)
So you can have a strategy to manage the failure between two transaction, and make decision like: "if something fail, roll back all the modification done untill now" (sending a rollback command to each aggregate), or "if an operation fail repeat it 3 times each 30 minutes and if doens't work then rollback", "if something fail create a notification for the system admin".
(sorry for the long post, at least hope it helps)

Resources