CQS: Who is responsible for data caching and when? - domain-driven-design

When and who should be responsible for caching data into local data store from API GET requests in DDD architecture with CQS based use cases?
First thing that comes to mind:
Initiate a Query to get some data from local data store and if it is empty, fetch required data from API -> cache it into local data store -> return it
This solution does not seem to follow CQS correctly because Queries should not alter data store (or can they?).
Second thing that comes to mind:
Execute a Command to fetch fresh data from API -> update data store -> raise a data updated event -> event handler listens for data updated events and executes new Query to get fresh data
Second solution seems to follow CQS pattern better, tho I am not sure if any of these solutions are by any means correct way of handling data caching in CQS based architecture.

The first option isn't, to my mind, any "bigger" of a violation of CQS/CQRS as the second. The query isn't altering authoritative state (e.g. a DDD aggregate), it's just copying it into a cache. It does require cache invalidation.
The second is questionable because a query results in a command (it's sometimes reasonable to treat a query as a read-only command (provided the query limits itself to a single aggregate) in order to have a stronger consistency guarantee).
A better approach to my mind is to have a hybrid of the two:
Queries are served from the cache if possible
When commands result in updates to aggregates, events are published (ideally listing the aggregates which changed)
Event handler listens for update events and invalidates cache based on which aggregates have changed.
A further evolution of this would be event sourcing, where the updates are the events and the queries only get served by a read model fed by the events.

Related

How to implement atomicity in node js transactions

I am working on an application in which client(android/reactjs) clicks a button and five operations takes place, let say,
add a new field
update the old field
upload a photo
upload some text
delete some old fields.
Now sometimes due to network issue or any another issue only some operations takes place and db gets corrupted. So my question is how can I make all this transactions one i.e. atomic i.e. either all will complete or the done operations will be rollback. And where should I do this in client(reactjs/android) or in backend(nodejs) with API ? I thought of making an API on backend(since chances of backend goes down is rare) and keep the track of the operations done(statelessly like using arrays). If in any case transaction get stopped, roll back all the done operations. But I found this expensive and it not covers the risk of server error. Can you suggest how can I implement/design this ?

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

Choosing a NoSQL database for storing events in a CQRS designed application

I am looking for a good, up to date and "decision helping" explanation on how to choose a NoSQL database engine for storing all the events in a CQRS designed application.
I am currently a newcomer to all things around NoSQL (but learning): please be clear and do not hesitate to explain your point of view in an (almost too much) precise manner. This post may deserve other newcomers like me.
This database will:
Be able to insert 2 to 10 rows per updates asked by the front view (in my case, updates are frequent). Think of thousand of updates per minute, how would it scale?
Critically need to be consistent and failure safe, since events are the source of truth of the application
Not need any link between entities (like RDBMS does) except maybe a user ID/GUID (I don't know if it's critical or needed yet)
Receive events containing 3 to 10 "columns" (a sequence ID, an event name, a datetime, a JSON/binary encoded parameter bag, some context informations..). Without orientating your point of view in a column-oriented type of database, it may be document-oriented if it fits all other requirements
Be used as a queue or sent to/read from an external AMQP system like RabbitMQ or ZeroMQ (didn't worked that part yet, if you could also argument/explain..) since view projections will be built upon events
Need some kind of filtering by sequence ID like SELECT * FROM events WHERE sequence_id > last_sequence_id for subscribers (or queue systems) to be able to synchronize from a given point
I heard of HBase for CQRS event storing, but maybe MongoDB could fit? Or even Elasticsearch (would not bet on that one..)? I'm also open to RDBMS for consistency and availability.. but what about the partition tolerance part..?
Really I'm lost, I need arguments to make a pertinent choice.
https://geteventstore.com/ is a database designed specifically for event streams.
They take consistency and reliability of the source of truth (your events) very seriously and I use it myself to read/write thousands of events a second.
I have a working, in production implementation of MongoDB as an Event store. It is used by a CQRS + Event sourcing web based CRM application.
In order to provide 100% transaction-less but transaction-like guarantee for persisting multiple events in one go (all events or none of them) I use a MongoDB document as an events commit, with events as nested documents. As you know, MongoDB has document level locking.
For concurrency I use optimistic locking, using a version property for each Aggregate steam. An Aggregate stream is identified by the dublet (Aggregate class x Aggregate ID).
The event store also stores the commits in relative order using a sequence on each commit, incremented on each commit, protected using optimistic locking.
Each commit contains the following:
aggregateId : string, probably a GUID,
aggregateClass: string,
version: integer, incremented for each aggregateId x aggregateClass,
sequence, integer, incremented for each commit,
createdAt: UTCDateTime,
authenticatedUserId: string or null,
events: list of EventWithMetadata,
Each EventWithMetadata contains the event class/type and the payload as string (the serialized version of the actual event).
The MongoDB collection has the following indexes:
aggregateId, aggregateClass, version as unique
events.eventClass, sequence
sequence
other indexes for query optimization
These indexes are used to enforce the general event store rules (no events are stored for the same version of an Aggregate) and for query optimizations (the client can select only certain events - by type - from all streams).
You could use sharding by aggregateId to scale, if you strip the global ordering of events (the sequence property) and you move that responsibility to an event publisher but this complicates things as the event publisher needs to stay synchronized (even in case of failure!) with the event store. I recommend to do it only if you need it.
Benchmarks for this implementation (on Intel I7 with 8GB of RAM):
total aggregate write time was: 7.99, speed: 12516 events wrote per second
total aggregate read time was: 1.43, speed: 35036 events read per second
total read-model read time was: 3.26, speed: 30679 events read per second
I've noticed that MongoDB was slow on counting the number of events in the event store. I don't know why but I don't care as I don't need this feature.
I recommend using MongoDB as an event store.
I have an .NET Core event sourcing implementation project https://github.com/jacqueskang/EventSourcing
I started with relational database (SQL Server and MySQL) using entity framework core.
Then moved to AWS so I wrote a DynamoDB extension.
My experience is that relational DB can do the job perfectly but it depends on requirement and your technical stack. If your project is cloud based then the best option might probably be cloud provider's no-sql database like AWS DynamoDB or Azure CosmosDB, which are powerful in proformance and provide additional features (e.g. DynamoDB can trigger a notification or lambda function)

Handle duplicates in batch POST requests to a REST API

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

Rebuild queries from domain events by multiple aggregates

I'm using a DDD/CQRS/ES approach and I have some questions about modeling my aggregate(s) and queries. As an example consider the following scenario:
A User can create a WorkItem, change its title and associate other users to it. A WorkItem has participants (associated users) and a participant can add Actions to a WorkItem. Participants can execute Actions.
Let's just assume that Users are already created and I only need userIds.
I have the following WorkItem commands:
CreateWorkItem
ChangeTitle
AddParticipant
AddAction
ExecuteAction
These commands must be idempotent, so I cant add twice the same user or action.
And the following query:
WorkItemDetails (all info for a work item)
Queries are updated by handlers that handle domain events raised by WorkItem aggregate(s) (after they're persisted in the EventStore). All these events contain the WorkItemId. I would like to be able to rebuild the queries on the fly, if needed, by loading all the relevant events and processing them in sequence. This is because my users usually won't access WorkItems created one year ago, so I don't need to have these queries processed. So when I fetch a query that doesn't exist, I could rebuild it and store it in a key/value store with a TTL.
Domain events have an aggregateId (used as the event streamId and shard key) and a sequenceId (used as the eventId within an event stream).
So my first attempt was to create a large Aggregate called WorkItem that had a collection of participants and a collection of actions. Participant and Actions are entities that live only within a WorkItem. A participant references a userId and an action references a participantId. They can have more information, but it's not relevant for this exercise. With this solution my large WorkItem aggregate can ensure that the commands are idempotent because I can validate that I don't add duplicate participants or actions, and if I want to rebuild the WorkItemDetails query, I just load/process all the events for a given WorkItemId.
This works fine because since I only have one aggregate, the WorkItemId can be the aggregateId, so when I rebuild the query I just load all events for a given WorkItemId.
However, this solution has the performance issues of a large Aggregate (why load all participants and actions to process a ChangeTitle command?).
So my next attempt is to have different aggregates, all with the same WorkItemId as a property but only the WorkItem aggregate has it as an aggregateId. This fixes the performance issues, I can update the query because all events contain the WorkItemId but now my problem is that I can't rebuild it from scratch because I don't know the aggregateIds for the other aggregates, so I can't load their event streams and process them. They have a WorkItemId property but that's not their real aggregateId. Also I can't guarantee that I process events sequentially, because each aggregate will have its own event stream, but I'm not sure if that's a real problem.
Another solution I can think of is to have a dedicated event stream to consolidate all WorkItem events raised by the multiple aggregates. So I could have event handlers that simply append the events fired by the Participant and Actions to an event stream whose id would be something like "{workItemId}:allevents". This would be used only to rebuild the WorkItemDetails query. This sounds like an hack.. basically I'm creating an "aggregate" that has no business operations.
What other solutions do I have? Is it uncommon to rebuild queries on the fly? Can it be done when events for multiple aggregates (multiple event streams) are used to build the same query? I've searched for this scenario and haven't found anything useful. I feel like I'm missing something that should be very obvious, but I haven't figured what.
Any help on this is very much appreciated.
Thanks
I don't think you should design your aggregates with querying concerns in mind. The Read side is here for that.
On the domain side, focus on consistency concerns (how small can the aggregate be and the domain still remain consistent in a single transaction), concurrency (how big can it be and not suffer concurrent access problems / race conditions ?) and performance (would we load thousands of objects in memory just to perform a simple command ? -- exactly what you were asking).
I don't see anything wrong with on-demand read models. It's basically the same as reading from a live stream, except you re-create the stream when you need it. However, this might be quite a lot of work for not an extraordinary gain, because most of the time, entities are queried just after they are modified. If on-demand becomes "basically every time the entity changes", you might as well subscribe to live changes. As for "old" views, the definition of "old" is that they are not modified any more, so they don't need to be recalculated anyways, regardless of if you have an on-demand or continuous system.
If you go the multiple small aggregates route and your Read Model needs information from several sources to update itself, you have a couple of options :
Enrich emitted events with additional data
Read from multiple event streams and consolidate their data to build the read model. No magic here, the Read side needs to know which aggregates are involved in a particular projection. You could also query other Read Models if you know they are up-to-date and will give you just the data you need.
See CQRS events do not contain details needed for updating read model

Resources