How to handle projection errors by event sourcing and CQRS? - domain-driven-design

I want to use event sourcing and CQRS, and so I need projections (I hope I use the proper term) to update my query databases. How can I handle database errors?
For example one of my query cache databases is not available, but I already updated the others. So the not-available database won't be in snyc with the others when it comes back to business. How will it know that it have to run for instance the last 10 domain events from the event storage? I guess I have to store information about the current state of the databases, but what if that database state storage fails? Any ideas, best practices how to solve this kind of problems?

In either case, you must tell your messaging bus that the processing failed and it should redeliver the event later, in the hope that the database will be back online then. This is essentially why we are using message bus systems with an "at least once"-delivery guarantee.
For transactional query databases, you should also rollback the transaction, of course. If your query database(s) do not support transactions, you must make sure on the application side that updates are idempotent - i.e., if your event arrives on the next delivery attempt, your projection code and/or database must be designed such that the repeated processing of the event does not harm the state of the database. This is sometimes trivial to achieve (e.g., when the event leads to a changed person's name in the projection), but often not-so-trivial (e.g., when the projection simply increments view counts). But this is what you pay for when you are using non-transactional databases.

Related

Why do we need command model in CQRS

So I was watching this video Event Sourcing You are doing it wrong by David Schmitz at the 15:17 he was talking about eventual consistency in event-sourcing. At first, I was like oh I got it so this is why CQRS is helpful with Event-sourcing because we can validate this things through command model before publishing an event right? but after I did a few research I was wrong. So I wonder why command model even exists since we can just retrieve the request body (suppose it's http request) put some business logic and then publish event.
With Event Sourcing we store events. A model exists in your application to support applying your business logic before deciding to save new event(s). To be able to make that decision, it must be possible to consistently read/write events to materialize your model before applying your business logic on it when a command is processed.
You need to store the events consistently to be able to make decisions further on. If you only publish your events to other parts of the system your model cannot use them in a consistent way.
The publishing of events to other systems is something that potentially can happen as a side-effect to use these events to also create read-models/projections or to react to them in other ways.
It does not have to be a push/publish though. It is perfectly valid to have a pull-based solution where downstream systems poll for events.
For example, in Serialized we store events in Aggregates. Feeds are used to provide a poll-based (eventually-consistent) downstream view of these events, but there is no publishing at all.

Determine timestamp of replication of documents during live sync in Cloudant

Right now I have transaction syncs with clients that go offline and online frequently. This means that creation of a transaction document (when it goes into pouch) doesn't align with the point that it is entered into Couch.
Is there a way for me to tag these documents with a timestamp on confirmation of replication? I see there are advanced replication schedulers but the completed flag does not apply to live replication which is what we are using.
I have tried tagging the document before syncing it, but this doesn't account for issues of network delay or backend delay of replication. It simply is the time I started the sync of that document, there's no promise that it arrived at that point in CouchDB.
You would need to use an add-on like spiegel (using on_change documents to call back to an update function) or another (pouchdb?) client to observe the changes feed and add the timestamp of when it was available to that client from the couchdb (which might be a little delayed).
Such a client would be in danger of creating an infinite loop as #Flimzy indicated in the comments, unless it uses a rule to not re-update docs with existing timestamps so that it does not write when it is re-triggered by itself and therefore stops retriggering itself. Spiegel has support for such a rule and/or stopping an infinite loop could be part of an update function.

DomainEventPublisher consistency

Having just read Vaughn Vernon's effective aggregate design, I'm wondering about failures related to event publishing.
In the given example at page 9 (page 3 of the PDF), we call DomainEventPublisher.publish(). The event being published allows other aggregates to execute their behaviours.
What I'm wondering is: What happens if DomainEventPublisher.publish() fails ? What happens if DomainEventPublisher.publish() succeeds, but the transaction fails ?
How implementations handle these two cases ?
DomainEventPublisher.publish() is synchronous. You'd setup a generic handler (handles all events) which stores the events in the same database transaction as the business process, which means your event storage must have the ability to be transactionnal with whatever other storage mechanism you rely on to store the state of your aggregates.
Once events have been written on disk transactionnaly, you can then put them on a message queue for asynchronous delivery.
Are there other known ways to do it?
Well, rather than using a static DomainEventPublisher you could record events in a collection on the AR, just like in event sourcing and then implement a centralised mechanism to store them (e.g. transaction hooks, using aspects, etc.).
What happens if DomainEventPublisher.publish() succeeds, but the
transaction fails?
In this case I am against Vernon approach. I prefer to return the events to the application service. This way I can persist the changes performed by the aggregate using a transaction (if needed) and, if everything is Ok, I will publish the event. This also helps to keep the business layer entirely clean and pure.
In a few words; if the transaction fails then no event is raised.
What happens if DomainEventPublisher.publish() fails?
A domain event never fails, by business rules, because it's a notification of things that happened. If an aggregate said Yes to the operation and return a event expressing the business changes; then nothing in the world should say that this operation can not be done or has to be undone.
If the event fails by infrastructure then you need to have the tools to re-raise it (automatically or manually) when the outage is fixed and eventually archive the consistency in your system. Take a look at NServiceBus. It provides retries, error queues, logs and so on to never loose the events.
If the message system is down you have at least event logs that you can use to re-rise them into the message system.

How to reliably store event to Azure CosmosDB and dispatch to Event Grid exactly once

I'm experimenting with event sourcing / cqrs pattern using serverless architecture in Azure.
I've chosen Cosmos DB document database for Event Store and Azure Event Grid for dispachting events to denormalizers.
How do I achieve that events are reliably delivered to Event Grid exactly once, when the event is stored in Cosmos DB? I mean, if delivery to Event Grid fails, it shouldn't be stored in the Event Store, should it?
Look into Cosmos Db Change Feed. Built in event raiser/queue for each change in db. You can register one or many listeners/handlers. I.e. Azure functions.
This might be exactly what you are asking for.
Some suggest you can go directly to cosmos db and attach eventgrid at the backside of changefeed.
You cannot but you shouldn't do it anyway. Maybe there are some very complicated methods using distributed transactions but they are not scalable. You cannot atomically store and publish events because you are writing to two different persistences, with different transactional boundaries. You can have a synchronous CQRS monolith, but only if you are using the same technology for the events persistence and readmodels persistence.
In CQRS the application is split in Write/Command and Read/Query sides (this long video may help). You are trying to unify the two parts into a single one, a downgrade if you will. Instead you should treat them separately, with different models (see Domain driven design).
The Write side should not depend on the outcome from the Read side. This means, that after the Event store persist the events, the Write side is done. Also, the Write side should contain all the data it needs to do its job, the emitting of events based on the business rules.
If you have different technologies in the Write and Read part then your Read side should be decoupled from the Write side, that is, it should run in a separate thread/process.
One way to do this is to have a thread/process that listens to appends to the Event store, fetch new events then publish them to the Event Grid. If this process fails or is restarted, it should resume from where it left off. I don't know if CosmosDB supports this but MongoDB (also a document database) has the rslog that you can tail to get the new events, in a few milliseconds.

How to control idempotency of messages in an event-driven architecture?

I'm working on a project where DynamoDB is being used as database and every use case of the application is triggered by a message published after an item has been created/updated in DB. Currently the code follows this approach:
repository.save(entity);
messagePublisher.publish(event);
Udi Dahan has a video called Reliable Messaging Without Distributed Transactions where he talks about a solution to situations where a system can fail right after saving to DB but before publishing the message as messages are not part of a transaction. But in his solution I think he assumes using a SQL database as the process involves saving, as part of the transaction, the correlationId of the message being processed, the entity modification and the messages that are to be published. Using a NoSQL DB I cannot think of a clean way to store the information about the messages.
A solution would be using DynamoDB streams and subscribe to the events published either using a Lambda or another service to transformed them into domain-specific events. My problem with this is that I wouldn't be able to send the messages from the domain logic, the logic would be spread across the service processing the message and the Lambda/service reacting over changes and the solution would be platform-specific.
Is there any other way to handle this?
I can't say a specific solution based on DynamoDB since I've not used this engine ever. But I've built an event driven system on top of MongoDB so I can share my learnings you might find useful for your case.
You can have different approaches:
1) Based on an event sourcing approach you can just save the events/messages your use case produce within a transaction. In Mongo when you are just inserting/appending new items to the same collection you can ensure atomicity. Anyway, if the engine does not provide that capability the query operation is so centralized that you are reducing the possibility of an error at minimum.
Once all the events are stored, you can then consume them and project them to a given state and then persist the updated state in another transaction.
Here you have to deal with eventual consistency as data will be stale in your read model until you have projected the events.
2) Another approach is applying the UnitOfWork pattern where you cache all the query operations (insert/update/delete) to save both events and the state. Once your use case finishes, you execute all the cached queries against the database (flush). This way although the operations are not atomic you are again centralizing them quite enough to minimize errors.
Of course the best is to use an ACID database if you require that capability and any other approach will be a workaround to get close to it.
About publishing the events I don't know if you mean they are published to a messaging transportation mechanism such as rabbitmq, Kafka, etc. But that must be a background process where you fetch the events from the DB and publishes them in order to break the 2 phase commit within the same transaction.

Resources