In Event-sourcing how to deal with failure in production?

In Event-sourcing how to deal with failure in production? - domain-driven-design

So suppose I want to make an e-commerce system using Event-driven architecture with Event-sourcing. Let's say the user want to buy a product and its price is 1$, but I miss calculate and it becomes 2$. Now the user would lose 2$ from his wallet instead of 1$. So if it was CRUD I could just simply fix the bug and connect to the database host and fix the user wallet (also give him some apologize). but in event-sourcing, as far as I know, we should not edit or delete event (only append) since it's single source of truth. So how should I deal with this kind of failure? One thing I can think of that's to create an admin page which can publish any kind of events and fix it like this.
AccountCreatedEvent { userId: 1, balance: 3 }
ProductPurchasedEvent { userId: 1, price: 2 } // **miss calculate price should be 1$
DepositMoneyEvent { userId: 1, amount: 1 } // manually fixed by admin
I know it seems weird, but what if I really have to fix the bug and also valid the data how do we achieve that in event-sourcing

A common answer is that you look to the domain. For example, what business processes exist to mitigate the contingency that a customer is over charged?
Does our business have a process for refunds?
The "right" answer is to implement that process. The resulting history will look like an event that over charges the customer, and later another event that refunds the overcharged amount to make things right.
This is, of course, exactly your fix it and apologize approach; the main difference being that we treat the error correction as part of the system, rather than being something we improvise.
Memories, Guesses, and Apologies (Pat Helland, 2007) is a good starting point.
Another example would be a fault where the system did the right thing, but wrote down the wrong information. A common pattern here is to process this mistake the way it would be done in accounting - an event is produced to "reverse" the transcription error, and a new event is created to introduce the correct accounting.
Again, notice that this correction process is part of the domain of accounting. Our job here is to faithfully re-create the error correcting processes that already exist.
The basic pattern is the same throughout; we add more events to the system to correct the mistakes (and more events to correct mistakes in the corrections; it's turtles all the way down).
When you've got processes that are triggered by the events that appear in the stream, you may end up playing "chase".
We overcharged the customer, but this meant that the customer was automatically enrolled in the VIP discount program. When we fix the error, do we also need to remove their VIP status? What happens to the discounted purchases they made before the error was discovered?

Related

How to name an event describing the acknowledgment of the existence of an entity in an event sourced system?

I am new to Event Sourcing and I am considering using it for an industrial application to track events happening in a production facility.
Since the book of record is the production facility itself and not the system, and also because not everything is automated, workers will need to report at a given point in time (the recorded time) what they did at another point in time (the effective time). Therefore, I will be using events such as: TankFilledRecorded, TankOutputConnectedToPipeInputRecorded, ContainerMovedToFacilityAreaRecorded, etc. where these events refer to entities such as a tank, a pipe, or a facility area for example. These events will have both a recorded time and an effective time. Note that there is no submission or approval process for a record to be considered legit.
Domain-driven design (DDD) encourages to design events that are representative of what happens in the domain (like the ones above).
However, in my domain, I don’t care so much about how a tank, a pipe or a facility area came to existence. I just need to know that something exists from a particular point in time, and I also need to know if it is not there after a particular point in time. The main objective of the software is to track liquids and powders flowing in a circuit made of these pipes, tanks and other components. It is not an asset management system and should not become one.
Therefore, what would be the correct DDD way to design an event that represents the fact that there is a tank, a pipe or an area in the production facility?
It is a subtle question but language is important, particularly in DDD.
Here is what I came up with:
1 EntityExistenceAcknowledgmentRecorded
TankExistenceAcknowledgmentRecorded
PipeExistenceAcknowledgmentRecorded
FacilityAreaExistenceAcknowledgmentRecorded
TankDisappearanceAcknowledgmentRecorded
PipeDisappearanceAcknowledgmentRecorded
FacilityAreaDisappearanceAcknowledgmentRecorded
It seems awful to use this in the ubiquitous language. I don’t see myself talking in these terms or providing a UI with such vocabulary. But it does represent exactly what happens though.
2 EntityRegistered
TankRegistered
PipeRegistered
FacilityAreaRegistered
TankUnregistered
PipeUnregistered
FacilityAreaUnregistered
It seems much simpler and it also seems to be meaningful except for one thing. “Registered” conveys the existence of the representation of an entity in the system with immediate effect, without the possibility of saying now that the entity existed 2 days ago. Think about a UserRegistered event in a website that would indicate that the user “existed” from 10 days ago. What would that even mean?
Events are facts and you cannot change the past. However, I do need a way for my users to invalidate a record in which they made a mistake such as a typo. They can record now that they acknowledged the existence of a facility area a week ago and might realize later than there was something wrong, such as a typo in the name of the entity. They would invalidate the record and create a new one. But, invalidate something that has been “registered” does not sound right.
3 Keep looking
Try to dig more in the domain (event storming) and find the real events that brought the entities into existence even if these events are of no use in the problem that needs to be solved.
TankBuiltRecorded
PipeBuiltRecorded, PipeDeliveredRecorded
FacilityArea<something_meaningful>Recorded
TankDestroyedRecorded, TankDecommissionedRecorded
PipeDecommissionedRecorded
FacilityArea<something_meaningful>Recorded

A caution
TankFilled
TankFilledReported
TankFilledReportSubmitted
TankFilledReportSubmissionReceived
Think carefully about whether the increased precision is motivated by business value.
Therefore, what would be the correct DDD way to design an event that represents the fact that there is a tank, a pipe or an area in the production facility?
What is the business doing today? Is there already a process in place for tracking the lifetime of the hardware in the plant (a maintenance log, perhaps?) There's likely to be vocabulary in that place that gives you ideas as to what spellings would make sense in the code.
Events are facts and you cannot change the past.
That's true - but you can back date events. The effective date of the information is often distinct from the reported date of information.
I do need a way for my users to invalidate a record in which they made a mistake such as a typo.
Yes - error correction is an important part of the process that you are modeling.
You should probably review Greg Young's talk Answering a Question, which was based on this thread. It's a discussion of capturing and modeling of temporality.
Here's the good news: you are running into the right problem. Because you are capturing information about an external system, there are going to be opportunities for errors and conflicts, and you need to (a) figure out the protocols for addressing them, and then (b) model that process correctly. That might include exception reports generated by the system when it observes conflicting information, or compensating events, or even automated conflict resolution (for the easy cases -- see also Stop Over Engineering).

Modeling one-to-many relations using Domain Driven Design

This question is more of a general question about how to model simple one-to-many relations using collections: should a change in a list item be reflected in the version of the aggregate containing it?
The domain is about meeting scheduling (like in Outlook).
I have a Meeting entity, which can have multiple Participants.
A participant can accept/decline meeting requests.
Rescheduling a meeting nullifies all of the participants confirmations.
I thought of two ways to model this.
Option 1
The Meeting aggregate will contain a list of Participants where each Participant has a ParticipantId and a Status (accepted/denied).
The problem here is that every Accept or Deny command, for a specific participant, increments the Meeting's version, which means two participants will enter a race condition if trying to Accept the meeting request based on the same original version.
Although this could be solved by re-reading the Meeting's document and retrying the Accept command, it's quite annoying considering how often this could happen.
Another approach is to ignore the meeting's version when executing the Accept command, but this introduces a new problem: what happens if, after sending the meeting requests, the meeting has been rescheduled? In this case we can't afford to ignore the Meeting's version, because this time the version DOES represent a real version that should be considered.
BTW, is it at all a good practice to ignore the version in some of the commands and not in others?
Option 2
Extract a Participation aggregate out of Meeting.
Participation will have MeetingId, ParticipantId, and Status.
It will also have its own version.
This way, when participant X Accepts the meeting request, only the relevant Participation will be modified, and the rest will be left intact.
And, when rescheduling the meeting, a "Meeting Rescheduled" event will be published and an event handler will respond to it by resetting all of the Participations' statuses to "NotAccepted" regardless of their current version.
On the one hand this sounds logical in the sense that a meeting's version shouldn't be incremented just because someone accepted/denied its request.
On the other hand, modeling Participation as a standalone aggregate doesn't sound quite right to me, because it is has no meaning outside of the context of the meeting.
Anyway, would love to get feedback on this and see the various approaches to this problem.

Although this could be solved by re-reading the Meeting's document and retrying the Accept command, it's quite annoying considering how often this could happen.
This looks like a modeling error. You should keep in mind that the meeting aggregate is not the book of record for the participants availability - the real world is. So the message shouldn't be AcceptInvitation, but instead InvitationAccepted. There shouldn't be a conflict about this, because the domain model doesn't get to veto events outside of its authority boundary.
You might, depending on your implementation, end up with a concurrent modification exception in your plumbing, but that's something that you should be handling automatically (ie: expected version any, or a retry).
Another approach is to ignore the meeting's version when executing the Accept command, but this introduces a new problem: what happens if, after sending the meeting requests, the meeting has been rescheduled?
The solution here is to model more carefully. Yes, sometimes you will get a message that accepts or declines an invitation that has expired.
Put another way: race conditions don't exist.
A microsecond difference in timing shouldn’t make a difference to core business behaviors.
What happens to Alice, who replied instantly to the invitation, when the meeting is rescheduled? Why wouldn't the same thing happen to Bob, when his reply arrives just after the meeting is rescheduled?
Participation as a standalone aggregate doesn't sound quite right to me, because it is has no meaning outside of the context of the meeting.
I find that heuristic isn't particularly effective. It's much more important to understand whether entities can change state independently, or if their changes need to be coordinated.
Actually, the Meeting aggregate is used to track the participants availability. That's what it purpose is. Unless I didn't fully understand you...
It's a bit subtle, and I didn't spell it out very well.
Suppose the model says that I'm available, but an emergency in the real world calls me away. What happens? Am I blocked from going to the hospital because the model says I have to go to a meeting? Can somebody cancel my emergency by changing the invitation I've submitted?
Furthermore, if I'm away on an emergency, are you available for a meeting that is scheduled for the same time as the meeting you and I were going to have?
In this space, the real world is the authority for whether or not somebody is available. The model is just looking at a cached copy of a message describing whether or not somebody was available in the past.
The cached information being used by the model is not guaranteed to be complete. See Greg Young on warehouse systems and exception reports.
which makes me think that perhaps the Meeting aggregate should have two version fields: one will be a strong version which, when incremented, represents a breaking change, and another soft version for non-breaking changes. Does this make any sense?
Not really. Version is not, as far as I know, a term taken from the ubiquitous language of scheduling meetings. It's meta data, if it exists at all, and the business rules in your model should not depend upon meta data.
I agree, but a Meeting ID (or any ID for that matter) is also not part of the ubiquitous language, yet I might pass it back and forth between my domain world and external worlds.

How to resolve Order and Warehouse bounded contexts dependency?

I am working on DDD project and I am currently focused on two bouned contexts, Orders and Warehouse.
What confuses me is the following situation:
Order keep track of all the placed orders, and warehouse keeps track about all the available inventory. If user places one order for certain product item, that would mean one less item of that product in a warehouse. I am oversimplifying this process, so please bear with me.
Since two domain models are placed inside of a different BC, i am currently relying on eventual consistency ie. after one item has been sold, it would eventually be removed from the warehouse.
That situation unfortunately leads to the problem scenario where another user could simultaneously make another order of the same item, and it would appear as available until eventual consistency kicks is. That is something it is unacceptable by the domain expert.
So IMO I am stuck with two options
merge order and warehouse (at least the part regarding product
inventory, units available in warehouse) into one BC
have Order BC (or microservice if you wish) to be dependent of Warehouse BC (microservice) in order to pull a live product units
available
Which option does actually follows DDD patern? Is there another way out?

You could use a reservation system with a timeout.
Using a messaging analogy: With a broker-style queuing mechanism (such as RabbitMQ) you get a message from the queue and you have control over it until you either acknowledge that it can be removed from the queue or you release it back to the queue.
You could do the same thing in your ordering process. You reserve any items on your order. SO when you add them they have a status of, say, reserving and upon sending some message to reserve the items. If the response comes back you can decide how to proceed. Perhaps you could add any items that cannot be reserved onto a back order or try again later.
There are going to be different ways to approach this. Depending on your business case it may be acceptable to only check availability when someone really accepts the order.
If you domain expert reckons it is totally unacceptable that having this resolved at the end of the process then you could move it to the start. The issue is of course that user A could reserve and never buy thereby losing user B as a customer; whereas only applying the real "taking" of the item at the end of the process is closer to ensuring a purchase. But that is a business decision.

This issue is a really great example of where reality actually is eventually consistent. Is it really the best thing to decline an order if there is no inventory currently in the warehouse - even if there was a replenishment due in the next 20 minutes?
What if the item was actually on the shelf, but the operator hadn't yet keyed it into the system?
Sometimes designers and domain experts assume that people want 100% consistency, when really, users might be willing to accept a delay in confirmation of their order, if it increased the chance that their order would be accepted rather than rejected.
In the case above, why make it the user's job to retry their order N minutes later? In an eventually consistent system, you can accommodate such timing flexibility by including a timeout to retry the attempt to fulfill the order for a period of time before confirming to the client that it really wasn't possible.
There are technical solutions that will give you 100% consistency, but I think really this is not a technical challenge but a cultural/mindset one, changing people's minds about what is possible & acceptable to achieve an what is actually a better outcome.

IMO you can build a PlaceOrderSaga which will ask for inventory availability before placing the order.

Aggregate Root including tremendous number of children

I wonder how to model Calendar using DDD and CQRS. My problem consist in increasing number of events. I consider Calendar as Aggregate Root which contains Events (Calendar Events). I dont want to use ReadSide in my Commands but I need way to check events collisions at domain level.

I wonder how to model Calendar using DDD and CQRS. My problem consist in increasing number of events.
The most common answer to "long lived" aggregates is to break that lifetime into episodes. An example of this would be the temporary accounts that an accountant will close at the end of the fiscal year.
In your specific case, probably not "the Calendar" so much as "the February calendar", the "the March calendar", and so on, at whatever grain is appropriate in your domain.
Im not sure if Im right about DDD aproach in terms of validation. I believe the point is not to allow the model to enter into invalid state
Yes, but invalid state is a tricky thing to define. Udi Dahan offered this observation
A microsecond difference in timing shouldn’t make a difference to core business behaviors.
More succinctly, processing command A followed by processing command B produces a valid state, then it should also be true that you end up processing command B first, and then A.
Let's choose your "event collisions" example. Suppose we handle two commands scheduleMeeting(A) and scheduleMeeting(B), and the domain model understands that A and B collide. Riddle: how do we make sure the calendar stays in a valid state?
Without loss of generality, we can flip a coin to decide which command arrives first. My coin came up tails, so command B arrives first.
on scheduleMeeting(B):
publish MeetingScheduled(B)
Now the command for meeting A arrives. If your valid calendars do not permit conflicts, then your implementation needs to look something like
on scheduleMeeting(A):
throw DomainException(A conflicts with B)
On the other hand, if you embrace the idea that the commands arrive shouldn't influence the outcome, then you need to consider another approach. Perhaps
on scheduleMeeting(A)
publish MeetingScheduled(A)
publish ConflictDetected(A,B)
That is, the Calendar aggregate is modeled to track not only the scheduled events, but also the conflicts that have arisen.
See also: aggregates and RFC 2119

Event could also an be an Aggregate root. I don't know your business constraint but I think that if two Events colide you could notify the user somehow to take manual actions. Otherwise, if you really really need them not to colide you could use snapshots to speed up the enormous Calendar AR.
I dont want to use ReadSide in my Commands but I need way to check events collisions at domain level.
You cannot query the read model inside the aggregate command handler. For the colision detection I whould create a special DetectColisionSaga that subscribes to the EventScheduled event and that whould check (possible async if are many Events) if a colision had occurred and notify the user somehow.

Should I use Command to implement a domain derivations in CQRS

I'm using CQRS on an air booking application. one use case is help customer cancel their tickets. But before the acutal cancellation, the customer wants to know the penalty.
The penalty is calculated based on air rules. Some of our provider could calculate the penalty through exposing an web service while the others don't. (They publish some paper explaining the algorithm instead). So I define a domain service
public interface AirTicketService {
//ticket demand method
MonetaryAmount penalty(String ticketNumber);
void cancel(String ticketNumber, MonetaryAmount penalty);
}
My question is which side(command/query) is responsible for invoking this domain service and returning result in a CQRS style application?
I want to use a Command: CalculatePenlatyCommand, In this way, it's easy to resuse the domain model, but it's a little odd because this command does not modify state.
Or should I retrieve a readmodel of ticket if this is a query? But the same DomainService is needed on both command and query side, it's odd too.
Is domain derivation a query?

There is no need to shoehorn everything in to the command-query pipeline. You could query this service independently from the UI without issuing a command or asking the read-model.

There is nothing wrong with satisfying a query using an existing model if it "fits" both the terminology and the structure of that model. No need to build up a separate read model for that purpose. It's not without risk, since the semantics and the context of the query should be closely tied to the model that is otherwise used for write purposes only. The risk I allude to is the fact that the write and read concerns could drift apart (and we're back at square one, i.e. the reason why people pick CQRS in the first place). So you must keep paying attention as new requirements come in.
Queries that fit this model really well are what I call "simulators', where you want to run a simulation using current state to e.g. to give feedback to an end user. On more than one occasion I've found that the simulation logic could be reused both as a feedback mechanism and as an execution (of a write operation/command) steering mechanism. The difference is in what we do with the outcome of the simulation. Again, this is not without risk and requires careful judgement.

You may bring arguments that Calculate Penalty Command is not odd at all.
The user asks the system to do something - command enough.
You can even have a Penalty Calculation Requested Event event in your domain, and it would feel right. Because, at some time, you may be interested in, let's say, unsure clients, ones that want to cancel tickets but they change their mind every time etc. The calculation may be performed asynchronously, too - you can provide the result (penalty cost) to the user in various ways afterwards...
Or, in some other way: on your ticket booked event, store cancellation penalty, too. Then, you can make that value accessible any time, without the need to recompute it... But this may be wrong (?) because penalty would largely depend on time, right (the late you cancel your ticket, the more you pay)?
If all this would like over-complications etc., then I guess I agree with rmac's answer, too :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string