What data should be contained in the event? Only data that is specific to this event or some data from boundary context too.
For example. I have account with domain and name properties
account(id, name, domain)
When I change account name NameChanged(id, name) event is created. But when this event is used for read side projection (cassandra db) I need to fill two tables (example does not use a materialized view):
accounts(id, name, domain) (primary key only `id`)
accountsByDomain(domain, id, name) (primary key contains `domain` and `id`)
Second table can not be synced by name, because where is no domain in the event.
Question: must be the event as simple as possible (and calls the state of the entity to get information that might have been different at the time the event occurred) or must it have complete information for read side projection
We aren't usually limited to processing an event in isolation - the identifiers in the event are available to allow us to look up the other information we need (which could, for example, be included in other events in the same stream).
Reviewing Greg Young's talk on Polyglot Data may help clarify this idea.
Related
TL;DR
How can you find "unreachable keys" in a key/value store with a large amount of
data?
Background
In comparison to relational database that provide ACID guarantees, NoSQL
key/value databases provide fewer guarantees in order to handle "big data".
For example, they only provide atomicity in the context of a single key/value
pair, but they use techniques like distributed hash tables to "shard" the data
across an arbitrarily large cluster of machines.
Keys are often unfriendly for humans. For example, a key for a blob of data
representing an employee might be
Employee:39045e87-6c00-47a4-a683-7aba4354c44a. The employee might also have a
more human-friendly identifier, such as the username jdoe with which the
employee signs in to the system. This username would be stored as a separate
key/value pair, where the key might be EmployeeUsername:jdoe. The value for
key EmployeeUsername:jdoe is typically either an array of strings containing
the main key (think of it like a secondary index, which does not necessarily
contain unique values) or a denormalised version of employee blob (perhaps
aggregating data from other objects in order to improve query performance).
Problem
Now, given that key/value databases do not usually provide transactional
guarantees, what happens when a process inserts the key
Employee:39045e87-6c00-47a4-a683-7aba4354c44a (along with the serialized
representation of the employee) but crashes before inserting the
EmployeeUsername:jdoe key? The client does not know the key for the employee
data - he or she only knows the username jdoe - so how to you find the
Employee:39045e87-6c00-47a4-a683-7aba4354c44a key?
The only thing I can think of is to enumerate the keys in the key/value store
and once you find the appropriate key, "resume" the indexing/denormalisation.
I'm well aware of techniques like event sourcing, where an idempotent event
handler could respond to the event (e.g., EmployeeRegistered) in order to
recreate the username-to-employee-uuid secondary index, but using event
sourcing over key/value store still requires enumeration of keys, which could
degrade performance.
Analogy
The more experience I have in IT, the more I see the same problems being
tackled in different scenarios. For example, Linux filesystems store both file
and directory contents in "inodes". You can think of these as key/value pairs,
where the key is an integer and the value is the file/directory contents. When
writing a new file, the system creates an inode and fills it with data THEN
modifies the parent directory to add the "filename-to-inode" mapping. If the
system crashes after creating the file but before referencing it in the parent
directory, your file "exists on disk" but is essentially unreadable. When the
system comes back online, hopefully it will place this file into the
"lost+found" directory (I imagine it does this by scanning the entire disk).
There are plenty of other examples (such as domain name to IP address mappings
in the DNS system), but I specifically want to know how the above problem is
tackled in NoSQL key/value databases.
EDIT
I found this interesting article on manual secondary indexes but it doesn't "broken" or "dated" secondary indexes.
The solution I've come up with is to use a process manager (or "saga"),
whose key contains the username. This also guarantees uniqueness across
employees during registration. (Note that I'm using a key/value store
with compare-and-swap (CAS) semantics for concurrency control.)
Create an EmployeeRegistrationProcess with a key of
EmployeeRegistrationProcess:jdoe.
If a concurrency error occurs (i.e., the registration process
already exists) then this is a duplicate username.
When started, the EmployeeRegistrationProcess allocates an
employee UUID. The EmployeeRegistrationProcess attempts to create
an Employee object using this UUID (e.g.,
Employee:39045e87-6c00-47a4-a683-7aba4354c44a).
If the system crashes after starting the
EmployeeRegistrationProcess but before creating the Employee, we
can still locate the "employee" (or more accurately, the employee
regisration process) by the username "jdoe". We can then resume the
"transaction".
If there is a concurrency error (i.e., the Employee with the
generated UUID already exists), the RegistrationProcess can flag
itself as being "in error" or "for review" or whatever process we
decide is best.
After the Employee has successfully been created, the
EmployeeRegistrationProcess creates the secondary index
EmployeeUsernameToUuid:jdoe ->
39045e87-6c00-47a4-a683-7aba4354c44a.
Again, if this fails, we can still locate the "employee" by the
username "jdoe" and resume the transaction.
And again, if there is a concurrency error (i.e., the
EmployeeUsernameToUuid:jdoe key already exists), the
EmployeeRegistrationProcess can take appropriate action.
When both commands have succeeded (the creation of the Employee and
the creation of the secondary index), the
EmployeeRegistrationProcess can be deleted.
At all stages of the process, Employee (or
EmployeeRegistrationProcess) is reachable via it's human-friendly
identifier "jdoe". Event sourcing the EmployeeRegistrationProcess is
optional.
Note that using a process manager can also help in enforcing uniqueness
across usernames after registration. That is, we can create a
EmployeeUsernameChangeProcess object with a key containing the new
username. "Enforcing" uniqueness at either registration or username
change hurts scalability, so the value identified by
"EmployeeUsernameToUuid:jdoe" could be an array of employee UUIDs.
If to look at a question from the point of view of eventsourcing entities, then responsibility of an entity of EventStore includes the guaranteed saving an event into storage and sending for the bus. From this point of view it is guaranteed that the event will be written completely, and as the database in the append-only mode, there will never be a problem with a non-valid event.
At the same time of course it isn't guaranteed that all commands which generate events will be successfully executed - it is possible to guarantee only an order and protection against repeated execution of the same command, but not all transaction.
Further occurs as follows - the saga intercepts an original command, and tries to execute everything transaction. If any part of transaction comes to the end with an error, or for example, doesn't come to the end for the preset time - that process is rolled away by means of generation of the so-called compensating events. Such events can't delete an entity, however bring system to the consistent state similar to that the command never and was executed.
Note. If your specific implementation of the database for events is arranged so that the key value can guarantee record only of one couple, just serialize an event, and the combination from the identifier and the version of a root of the aggregate can be a key. The version of the aggregate in this case somewhat is a CAS operation analog.
About concurrency you can read this article: http://danielwhittaker.me/2014/09/29/handling-concurrency-issues-cqrs-event-sourced-system/
I am looking for a good, up to date and "decision helping" explanation on how to choose a NoSQL database engine for storing all the events in a CQRS designed application.
I am currently a newcomer to all things around NoSQL (but learning): please be clear and do not hesitate to explain your point of view in an (almost too much) precise manner. This post may deserve other newcomers like me.
This database will:
Be able to insert 2 to 10 rows per updates asked by the front view (in my case, updates are frequent). Think of thousand of updates per minute, how would it scale?
Critically need to be consistent and failure safe, since events are the source of truth of the application
Not need any link between entities (like RDBMS does) except maybe a user ID/GUID (I don't know if it's critical or needed yet)
Receive events containing 3 to 10 "columns" (a sequence ID, an event name, a datetime, a JSON/binary encoded parameter bag, some context informations..). Without orientating your point of view in a column-oriented type of database, it may be document-oriented if it fits all other requirements
Be used as a queue or sent to/read from an external AMQP system like RabbitMQ or ZeroMQ (didn't worked that part yet, if you could also argument/explain..) since view projections will be built upon events
Need some kind of filtering by sequence ID like SELECT * FROM events WHERE sequence_id > last_sequence_id for subscribers (or queue systems) to be able to synchronize from a given point
I heard of HBase for CQRS event storing, but maybe MongoDB could fit? Or even Elasticsearch (would not bet on that one..)? I'm also open to RDBMS for consistency and availability.. but what about the partition tolerance part..?
Really I'm lost, I need arguments to make a pertinent choice.
https://geteventstore.com/ is a database designed specifically for event streams.
They take consistency and reliability of the source of truth (your events) very seriously and I use it myself to read/write thousands of events a second.
I have a working, in production implementation of MongoDB as an Event store. It is used by a CQRS + Event sourcing web based CRM application.
In order to provide 100% transaction-less but transaction-like guarantee for persisting multiple events in one go (all events or none of them) I use a MongoDB document as an events commit, with events as nested documents. As you know, MongoDB has document level locking.
For concurrency I use optimistic locking, using a version property for each Aggregate steam. An Aggregate stream is identified by the dublet (Aggregate class x Aggregate ID).
The event store also stores the commits in relative order using a sequence on each commit, incremented on each commit, protected using optimistic locking.
Each commit contains the following:
aggregateId : string, probably a GUID,
aggregateClass: string,
version: integer, incremented for each aggregateId x aggregateClass,
sequence, integer, incremented for each commit,
createdAt: UTCDateTime,
authenticatedUserId: string or null,
events: list of EventWithMetadata,
Each EventWithMetadata contains the event class/type and the payload as string (the serialized version of the actual event).
The MongoDB collection has the following indexes:
aggregateId, aggregateClass, version as unique
events.eventClass, sequence
sequence
other indexes for query optimization
These indexes are used to enforce the general event store rules (no events are stored for the same version of an Aggregate) and for query optimizations (the client can select only certain events - by type - from all streams).
You could use sharding by aggregateId to scale, if you strip the global ordering of events (the sequence property) and you move that responsibility to an event publisher but this complicates things as the event publisher needs to stay synchronized (even in case of failure!) with the event store. I recommend to do it only if you need it.
Benchmarks for this implementation (on Intel I7 with 8GB of RAM):
total aggregate write time was: 7.99, speed: 12516 events wrote per second
total aggregate read time was: 1.43, speed: 35036 events read per second
total read-model read time was: 3.26, speed: 30679 events read per second
I've noticed that MongoDB was slow on counting the number of events in the event store. I don't know why but I don't care as I don't need this feature.
I recommend using MongoDB as an event store.
I have an .NET Core event sourcing implementation project https://github.com/jacqueskang/EventSourcing
I started with relational database (SQL Server and MySQL) using entity framework core.
Then moved to AWS so I wrote a DynamoDB extension.
My experience is that relational DB can do the job perfectly but it depends on requirement and your technical stack. If your project is cloud based then the best option might probably be cloud provider's no-sql database like AWS DynamoDB or Azure CosmosDB, which are powerful in proformance and provide additional features (e.g. DynamoDB can trigger a notification or lambda function)
EventSourcing works perfectly when we have particular unique EntityID but when I am trying to get information from eventStore other than particular EntityId i am having tough time.
I am using CQRS with EventSourcing. As part of event-sourcing we are storing the events in SQL table as columns(EntityID (uniqueKey),EventType,EventObject(eg. UserAdded)).
So while storing EventObject we are just serializing the DotNet object and storing it in SQL, So, All the details related to UserAdded event will be in xml format. My concern is I want to make sure the userName which is present in db Should be unique.
So, while making command of AddUser I have to query EventStore(sql db) whether the particular userName is already present in eventStore. So for doing that I need to serialize all the UserAdded/UserEdited events in Event store and check if requested username is present in eventStore.
But as part of CQRS commands are not allowed to query may be because of Race condition.
So, I tried before sending the AddUser command just query the eventStore and get all the UserNames by serializing all events(UserAdded) and fetch usernames and if requested username is unique then shoot command else throwing exception that userName already exist.
As with above approach ,we need to query entire db and we may have hundreds of thousands of events/day.So the execution of query/deserialization will take much time which will lead to performance issue.
I am looking for any better approach/suggestion for maintaining username Unique either by getting all userNames from eventStore or any other approach
So, your client (the thing that issues the commands) should have full faith that the command it sends will be executed, and it must do this by ensuring that, before it sends the RegisterUserCommand, that no other user is registered with that email address. In other words, your client must perform the validation, not your domain or even the application services that surround the domain.
From http://cqrs.nu/Faq
This is a commonly occurring question since we're explicitly not
performing cross-aggregate operations on the write side. We do,
however, have a number of options:
Create a read-side of already allocated user names. Make the client
query the read-side interactively as the user types in a name.
Create a reactive saga to flag down and inactivate accounts that were
nevertheless created with a duplicate user name. (Whether by extreme
coincidence or maliciously or because of a faulty client.)
If eventual consistency is not fast enough for you, consider adding a
table on the write side, a small local read-side as it were, of
already allocated names. Make the aggregate transaction include
inserting into that table.
Querying different aggregates with a repository in a write operation as part of your business logic is not forbidden. You can do that in order to accept the command or reject it due to duplicate user by using some domain service (a cross-aggregate operation). Greg Young mentions this here: https://www.youtube.com/watch?v=LDW0QWie21s&t=24m55s
In normal scenarios you would just need to query all the UserCreated + UserEdited events.
If you expect to have thousands of these events per day, maybe your events are bloated and you should design more atomically. For example, instead having a UserEdited event raised every time something happens on a user, consider having UserPersonalDetailsEdited and UserAccessInfoEdited or similar, where the fields that must be unique are treated differently from the rest of user fields. That way, querying all the UserCreated + UserAccessInfoEdited prior to accepting or not a command would be a lighter operation.
Personally I'd go with the following approach:
More atomicity in events so that everything that touches fields that should be globally unique is described more explicitly (e.g: UserCreated, UserAccessInfoEdited)
Have projections available in the write side in order to query them during a write operation. So for example I'd subscribe to all UserCreated and UserAccessInfoEdited events in order to keep a queryable "table" with all the unique fields (e.g: email).
When a CreateUser command arrives to the domain, a domain service would query this email table and accept or reject the command.
This solution relies a bit on eventual consistency and there's a possibility where the query tells us that field has not been used and allows the command to succeed raising an event UserCreated when actually the projection hadn't been updated yet from a previous transaction, causing therefore the situation where there are 2 fields in the system that are not globally unique.
If you want to completely avoid these uncertain situations because your business can't really deal with eventual consistency my recommendation is to deal with this in your domain by explicitly modeling them as part of your ubiquitous language. For example you could model your aggregates differently since it's obvious that your aggregate User is not really your transactional boundary (i.e: it depends on others).
As often, there's no right answer, only answers that fit your domain.
Are you in an environment that really requires immediate consistency ? What would be the odds of an identical user name being created between the moment uniqueness is checked by querying (say, at client side) and when the command is processed ? Would your domain experts tolerate, for instance, one out of 1 million user name conflict (that can be compensated afterwards) ? Will you have a million users in the first place ?
Even if immediate consistency is required, "user names should be unique"... in which scope ? A Company ? An OnlineStore ? A GameServerInstance ? Can you find the most restricted scope in which the uniqueness constraint must hold and make that scope the Aggregate Root from which to sprout a new user ? Why would the "replay all the UserAdded/UserEdited events" solution be bad after all, if the Aggregate Root makes these events small and simple ?
With GetEventStore (from Greg Young) you can use whatever string as your aggregateId/StreamId. Use the username as the id of the aggregate instead of guids, or a combination like "mycompany.users.john" as the key and.. voila! You have for free user name uniqueness!
I'm using a DDD/CQRS/ES approach and I have some questions about modeling my aggregate(s) and queries. As an example consider the following scenario:
A User can create a WorkItem, change its title and associate other users to it. A WorkItem has participants (associated users) and a participant can add Actions to a WorkItem. Participants can execute Actions.
Let's just assume that Users are already created and I only need userIds.
I have the following WorkItem commands:
CreateWorkItem
ChangeTitle
AddParticipant
AddAction
ExecuteAction
These commands must be idempotent, so I cant add twice the same user or action.
And the following query:
WorkItemDetails (all info for a work item)
Queries are updated by handlers that handle domain events raised by WorkItem aggregate(s) (after they're persisted in the EventStore). All these events contain the WorkItemId. I would like to be able to rebuild the queries on the fly, if needed, by loading all the relevant events and processing them in sequence. This is because my users usually won't access WorkItems created one year ago, so I don't need to have these queries processed. So when I fetch a query that doesn't exist, I could rebuild it and store it in a key/value store with a TTL.
Domain events have an aggregateId (used as the event streamId and shard key) and a sequenceId (used as the eventId within an event stream).
So my first attempt was to create a large Aggregate called WorkItem that had a collection of participants and a collection of actions. Participant and Actions are entities that live only within a WorkItem. A participant references a userId and an action references a participantId. They can have more information, but it's not relevant for this exercise. With this solution my large WorkItem aggregate can ensure that the commands are idempotent because I can validate that I don't add duplicate participants or actions, and if I want to rebuild the WorkItemDetails query, I just load/process all the events for a given WorkItemId.
This works fine because since I only have one aggregate, the WorkItemId can be the aggregateId, so when I rebuild the query I just load all events for a given WorkItemId.
However, this solution has the performance issues of a large Aggregate (why load all participants and actions to process a ChangeTitle command?).
So my next attempt is to have different aggregates, all with the same WorkItemId as a property but only the WorkItem aggregate has it as an aggregateId. This fixes the performance issues, I can update the query because all events contain the WorkItemId but now my problem is that I can't rebuild it from scratch because I don't know the aggregateIds for the other aggregates, so I can't load their event streams and process them. They have a WorkItemId property but that's not their real aggregateId. Also I can't guarantee that I process events sequentially, because each aggregate will have its own event stream, but I'm not sure if that's a real problem.
Another solution I can think of is to have a dedicated event stream to consolidate all WorkItem events raised by the multiple aggregates. So I could have event handlers that simply append the events fired by the Participant and Actions to an event stream whose id would be something like "{workItemId}:allevents". This would be used only to rebuild the WorkItemDetails query. This sounds like an hack.. basically I'm creating an "aggregate" that has no business operations.
What other solutions do I have? Is it uncommon to rebuild queries on the fly? Can it be done when events for multiple aggregates (multiple event streams) are used to build the same query? I've searched for this scenario and haven't found anything useful. I feel like I'm missing something that should be very obvious, but I haven't figured what.
Any help on this is very much appreciated.
Thanks
I don't think you should design your aggregates with querying concerns in mind. The Read side is here for that.
On the domain side, focus on consistency concerns (how small can the aggregate be and the domain still remain consistent in a single transaction), concurrency (how big can it be and not suffer concurrent access problems / race conditions ?) and performance (would we load thousands of objects in memory just to perform a simple command ? -- exactly what you were asking).
I don't see anything wrong with on-demand read models. It's basically the same as reading from a live stream, except you re-create the stream when you need it. However, this might be quite a lot of work for not an extraordinary gain, because most of the time, entities are queried just after they are modified. If on-demand becomes "basically every time the entity changes", you might as well subscribe to live changes. As for "old" views, the definition of "old" is that they are not modified any more, so they don't need to be recalculated anyways, regardless of if you have an on-demand or continuous system.
If you go the multiple small aggregates route and your Read Model needs information from several sources to update itself, you have a couple of options :
Enrich emitted events with additional data
Read from multiple event streams and consolidate their data to build the read model. No magic here, the Read side needs to know which aggregates are involved in a particular projection. You could also query other Read Models if you know they are up-to-date and will give you just the data you need.
See CQRS events do not contain details needed for updating read model
I am still trying to wrap my head around how to apply DDD and, most recently, CQRS to a real production business application. In my case, I am working on an inventory management system. It runs as a server-based application exposed via a REST API to several client applications. My focus has been on the domain layer with the API and clients to follow.
The command side of the domain is used to create a new Order and allows modifications, cancellation, marking an Order as fulfilled and shipped/completed. I, of course, have a query that returns a list of orders in the system (as read-only, lightweight DTOs) from the repository. Another query returns a PickList used by warehouse employees to pull items from the shelves to fulfill specific orders. In order to create the PickList, there are calculations, rules, etc that must be evaluated to determine which orders are ready to be fulfilled. For example, if all order line items are in stock. I need to read the same list of orders, iterate over the list and apply those rules and calculations to determine which items should be included in the PickList.
This is not a simple query, so how does it fit into the model?
UPDATE
While I may be able to maintain (store) a set of PickLists, they really are dynamic until an employee retrieves the next PickList. Consider the following scenario:
The first Order of the day is received. I can raise a domain event that triggers an AssemblePickListCommand which applies all of the rules and logic to create one or more PickLists for that Order.
A second Order is received. The event handler should now REPLACE the original PickLists with one or more new PickLists optimized across both pending Orders.
Likewise after a third Order is received.
Let's assume we now have two PickLists in the 'queue' because the optimization rules split the lists because components are at opposite ends of the warehouse.
Warehouse employee #1 requests a PickList. The first PickList is pulled and printed.
A fourth Order is received. As before, the handler removes the second PickList from the queue (the only one remaining) and regenerates one or more PickLists based on the second PickList and the new Order.
The PickList 'assembler' will repeat this logic whenever a new Order is received.
My issue with this is that a request must either block while the PickList queue is being updated or I have an eventual consistency issue that goes against the behavior the customer wants. Each time they request a PickList, they want it optimized based on all of the Order received to that point in time.
While I may be able to maintain (store) a set of PickLists, they really are dynamic until an employee retrieves the next PickList. Consider the following scenario:
The first Order of the day is received. I can raise a domain event that triggers an AssemblePickListCommand which applies all of the rules and logic to create one or more PickLists for that Order.
A second Order is received. The event handler should now REPLACE the original PickLists with one or more new PickLists optimized across both pending Orders.
This sounds to me like you are getting tangled trying to use a language that doesn't actually match the domain you are working in.
In particular, I don't believe that you would be having these modeling problems if the PickList "queue" was a real thing. I think instead there is an OrderItem collection that lives inside some aggregate, you issue commands to that aggregate to generate a PickList.
That is, I would expect a flow that looks like
onOrderPlaced(List<OrderItems> items)
warehouse.reserveItems(List<OrderItems> items)
// At this point, the items are copied into an unasssigned
// items collection. In other words, the aggregate knows
// that the items have been ordered, and are not currently
// assigned to any picklist
fire(ItemsReserved(items))
onPickListRequested(Id<Employee> employee)
warehouse.assignPickList(Id<Employee> employee, PickListOptimizier optimizer)
// PickListOptimizer is your calculation, rules, etc that know how
// to choose the right items to put into the next pick list from a
// a given collection of unassigned items. This is a stateless domain
// *domain service* -- it provides the query that the warehouse aggregate needs
// to figure out the right change to make, but it *doesn't* change
// the state of the aggregate -- that's the aggregate's responsibility
List<OrderItems> pickedItems = optimizer.chooseItems(this.unassignedItems);
this.unassignedItems.removeAll(pickedItems);
// This mockup assumes we can consider PickLists to be entities
// within the warehouse aggregate. You'd need some additional
// events if you wanted the PickList to have its own aggregate
Id<PickList> = PickList.createId(...);
this.pickLists.put(id, new PickList(id, employee, pickedItems))
fire(PickListAssigned(id, employee, pickedItems);
onPickListCompleted(Id<PickList> pickList)
warehouse.closePicklist(Id<PickList> pickList)
this.pickLists.remove(pickList)
fire(PickListClosed(pickList)
onPickListAbandoned(Id<PickList> pickList)
warehouse.reassign(Id<PickList> pickList)
PickList list = this.pickLists.remove(pickList)
this.unassignedItems.addAll(list.pickedItems)
fire(ItemsReassigned(list.pickedItems)
Not great languaging -- I don't speak warehouse. But it covers most of your points: each time a new PickList is generated, it's being built from the latest state of pending items in the warehouse.
There's some contention - you can't assign items to a pick list AND change the unassigned items at the same time. Those are two different writes to the same aggregate, and I don't think you are going to get around that as long as the client insists upon a perfectly optimized picklist each time. It might be worth while to sit down with the domain experts and explore the real cost to the business if the second best pick list is assigned from time to time. After all, there's already latency between the placing the order and its arrival at the warehouse....
I don't really see what your specific question is. But the first thing that comes to mind is that pick list creation is not just a query but a full blown business concept that should be explicitly modeled. It then could be created with AssemblePicklist command for instance.
You seem to have two roles/processes and possibly also two aggregate roots - salesperson works with orders, warehouse worker with picklists.
AssemblePicklistsCommand() is triggered from order processing and recreates all currently unassigned picklists.
Warehouse worker fires a AssignPicklistCommand(userid) which tries to choose the most appropriate unassigned picklist and assign it to him (or doing nothing if he already has an active picklist). He could then use GetActivePicklistQuery(userid) to get the picklist, pick items with PickPicklistItemCommand(picklistid, item, quantity) and finally MarkPicklistCompleteCommand() to signal order he's done.
AssemblePicklist and AssignPicklist should block each other (serial processing, optimistic concurency?) but the relation between AssignPicklist and GetActivePicklist is clean - either you have a picklist assigned or you don't.