ID duplication when persisting events (event sourcing)

ID duplication when persisting events (event sourcing) - domain-driven-design

Applying domain driven design I read about event sourcing. That saves a stream of events. The event table of database has these columns:
EventID, EventDate, AggregateId, EventData
I can save product, category and order events in this table. But aggregateId may be duplicated. In this situation I will get order event as product event.
How can I prevent duplication of system Id.

By definition, an Aggregate Root has a globally unique identifier. If you are doing DDD (and I assume you are since you tagged the question with DDD), and you are using event sourcing to capture an Aggregate Root's event stream, then you will need to find a way to ensure uniqueness across different aggregate types.
You can either generate a GUID or use some sort of composite key as others have suggested.

Can't you just make the AggregateID a composite key of type and id? so you can have Product:123456 and Category:123456? Or you could add another column for aggregate type if thats a better solution for you.

In addition to using a client side generated GUID, I would recommend adding a few columns to the table.
EventID, EventType, EventDate, AggregateType, AggregateId, MetaData, EventData
I would add the event type and the aggregate type for several reasons.
They will support rehydration of the underlying types, allow additional indexes on the table, and represent rather important information in the system. Finally it likely simplifies the read query on the table avoiding a couple of joins.
The metadata will allow tracking user and/or source event information.
Even though this adds the information to make it possible I would avoid using a composite key, stick with the GUID keys.
If you can't use a client side GUID then newsequentialguid() on the EventID as a primary key is the way to go.

Related

How to model relationship using event sourcing

In our scenario, we have a Course entity to represent course content. For each student attending a course, there is a CourseSession entity representing the learning progress of the student in the course. So there is a one-to-many relationship between Course and CourseSession. If using relational database, there will be a course table and course_session table, in which course has a unique ID and course session is uniquely identified by (courseId + studentId). We try to model this using event sourcing, and our event table is like following
-----------------------------------------------------
| entity_type | entity_id | event_type | event_data |
-----------------------------------------------------
this is fine for storing course, there is a courseId we can use as entity_id. But for CourseSession, there isn't an intrinsic id attribute, we have to use the concatenation of (courseId + studentId) as entity_id, which is not quite natural. Is there a better way to model this kind of relationship?

I’m not an expert, so take this answer with a grain of salt
But for CourseSession, there isn't an intrinsic id attribute, we have
to use the concatenation of (courseId + studentId) as entity_id, which
is not quite natural
It's normal to have a composite ID, and sometimes recommended, to keep your domain model aligned with the domain language.
The composite ID can be modeled as Value Object: CourseSessionId { CoursId: string, studentId: string }.
In addition to this domain-specific ID, you may need to add a surrogate ID to the entity to satisfy some infra requirements:
Some ORMs force to have a numeric sequence ID
Some Key-value stores require a ULID Key
Short and user-friendly ID
The surrogate ID is an infra detail and must be hidden as much as possible from the domain layer.
Is there a better way to model this kind of relationship?
The event sourcing pattern I saw in the DDD context suggests having a stream of events per aggregate.
In DDD, an aggregate can be considered as:
A subsystem within the bounded context
It has boundaries and invariants to protect its state
It’s represented by an entity (aggregate root) and can contain other entities and value-objects.
If you consider that CourseSession entity belongs to Course aggregate, then you should keep using course ID as entity_id (or aggregate_id) for both Course and CourseSession related events.
in this case, the write model (main model) can easily build and presents the relationship Course / CourseSessions by playing the Course stream.
Otherwise, you must introduce a read model, and define a projector that will subscribe to both Course and CourseSession streams, and build the needed views.
This read model can be queried directly or by Course and CourseSession aggregates’ commands to take decisions, but keep in mind that’s often eventually consistent, and your business should tolerate that.

Event sourcing is a different way of thinking about data. So the 'old' ways of thinking in terms of relationships don't really translate like that.
So the first point is that an event store isn't a table structure. It is a list of things that have happened in your system. The fact a student spent time on a course is a thing which happened.
If you want/need to access the data in relationships like you describe the easiest thing to do is to create a projection from the events which creates the data in the table form you are looking for.
However, as the projection is not the source of truth, why not think about creating de-normalised tables so you the database won't need to do any joins or other more complex jobs and your data will already be shaped as you need it for use in your application. This leads to super fast highly efficient read models.
Your users will thank you!

Read model with data from multiple aggregate roots (different contexts)

I'm curious to how to join data from multiple aggregate roots in an read model for an event sourced aggregate root. Can try to take a simple example:
If I have an aggregate root called Cart which supports following events in it's event stream (properties in parentheses - keep in mind this is an simple example):
AddProductToCart(cartId: Int, productId: Int)
RemoveProductFromCart(cartId: Int, productId: Int)
AddUserLicenseToProduct(cartId: Int, productId: Int, userId: Int)
RemoveUserLicenseFromProduct(cartId: Int, productId: Int, userId: Int)
EmptyCart(cartId: Int)
It's ok when projecting read models with data coming from this event stream. I can for example project a cart object which looks something like this:
Cart(cartId: Int, products: List[Product])
Product(productId: Int, userLicenses: List[UserLicense])
UserLicense(userId: Int)
But how does one join data from another aggregate root in another context into this cart projection. For example if I wanted to extend the read model with data from the Product aggregate root which lives in another context. Let's say I would like to extend it with productName and productType.
Take into consideration we are working in an distributed system, where Product and Cart would be live in different services/applications.
I suppose one solution would be to include the data in the commands and events. But that doesn't seem to scale very well if one would have larger read models with data from multiple aggregate roots. Also one has to be able to nuke and rebuild the read model.
I suppose another solution would be to duplicate data from other aggregate roots into the storage of other applications/services/contexts. For example duplicate the productName and productType data into storage owned be the Cart application, but not have it be a part of the Cart event stream. The Cart application would then have to listen to events (e.g. ProductCreated, ProductNameChanged) to keep data updated. I guess this might be a viable solution.

Each bounded context should be loosely coupled. We had a similar issue with two of our contexts. The solution that we found was to use workflows by creating all the communication between contexts in those files. In which we could synchronize the required schemas by subscribing to an event handler. As we used Elixir, the library that we have been used is Commanded, which have its own Event Bus.
But in a distributed systems you can use Apache Kafka. At the end of the day, I think that the easier solution should keep your schemas the cleanest possible (it also going to help you to respect the GDPR compliance) and to manage all your communication through a separate layer by an event handler.
To see this solution in a "real-life" way, I can recommend you a great example repository built with Elixir.
https://leanpub.com/buildingconduit/read

This question also comes up with event-driven architectures and not only event-sourcing. I reckon that you've covered most options in terms of capturing the relevant data from the producer of the event.
Another option would be that an event contain as little data as possible from the related bounded context. At a minimum that would be an identifier. However, in most cases some of the data should be denormalized to make sense. For instance, having the product description denormalized into the Cart and eventual Order would be helpful especially when someone changes the description after I have made my choice. The description may change from Blue pen to Red pen which would drastically alter what I intended to purchase. In this case the Product in your Shopping BC may be represented by a value object that contains the Id along with the Description.
If you now would like to augment the read-only data we are left only with the option of retrieving it from the source BC. This can be done in the read-model using some API (Rest/ACL) and then the data saved. To make it more fault tolerant one may opt for a messaging/service bus infrastructure to handle the retrieval of the additional data and updating of the relevant read-model record.

Should aggregate model contain metadata?

I would like to clarify, how the model of an aggregate should look like.
I have couple of events, which contains data which won't be ever used for validation.
For example, metadata like user_id who triggered the action (auditing), correlation_id (observability), labels / flags.
They will be received within the command and will be sent out as the property of the event. It won't be lost as each event is persisted. That's clear.
But should the aggregate object contain these values?
Projection will have them and will display them. Having them in Aggregate does not make sense in my opinion.
Or, it does. In case you want to create the snapshot, you need properties of all events.
Thanks for you advice.

Aggregates should contain only as much information as is required to maintain consistency. If your business rules require user_id, then that information should be persisted in the aggregate. Otherwise, it should not.

Stream aggregate relationship in an event sourced system

So I'm trying to figure out the structure behind general use cases of a CQRS+ES architecture and one of the problems I'm having is how aggregates are represented in the event store. If we divide the events into streams, what exactly would a stream represent? In the context of a hypothetical inventory management system that tracks a collection of items, each with an ID, product code, and location, I'm having trouble visualizing the layout of the system.
From what I could gather on the internet, it could be described succinctly "one stream per aggregate." So I would have an Inventory aggregate, a single stream with ItemAdded, ItemPulled, ItemRestocked, etc. events each with serialized data containing the Item ID, quantity changed, location, etc. The aggregate root would contain a collection of InventoryItem objects (each with their respective quantity, product codes, location, etc.) That seems like it would allow for easily enforcing domain rules, but I see one major flaw to this; when applying those events to the aggregate root, you would have to first rebuild that collection of InventoryItem. Even with snapshotting, that seems be very inefficient with a large number of items.
Another method would be to have one stream per InventoryItem tracking all events pertaining to only item. Each stream is named with the ID of that item. That seems like the simpler route, but now how would you enforce domain rules like ensuring product codes are unique or you're not putting multiple items into the same location? It seems like you would now have to bring in a Read model, but isn't the whole point to keep commands and query's seperate? It just feels wrong.
So my question is 'which is correct?' Partially both? Neither? Like most things, the more I learn, the more I learn that I don't know...

In a typical event store, each event stream is an isolated transaction boundary. Any time you change the model you lock the stream, append new events, and release the lock. (In designs that use optimistic concurrency, the boundaries are the same, but the "locking" mechanism is slightly different).
You will almost certainly want to ensure that any aggregate is enclosed within a single stream -- sharing an aggregate between two streams is analogous to sharing an aggregate across two databases.
A single stream can be dedicated to a single aggregate, to a collection of aggregates, or even to the entire model. Aggregates that are part of the same stream can be changed in the same transaction -- huzzah! -- at the cost of some contention and a bit of extra work to do when loading an aggregate from the stream.
The most commonly discussed design assigns each logical stream to a single aggregate.
That seems like it would allow for easily enforcing domain rules, but I see one major flaw to this; when applying those events to the aggregate root, you would have to first rebuild that collection of InventoryItem. Even with snapshotting, that seems be very inefficient with a large number of items.
There are a couple of possibilities; in some models, especially those with a strong temporal component, it makes sense to model some "entities" as a time series of aggregates. For example, in a scheduling system, rather than Bobs Calendar you might instead have Bobs March Calendar, Bobs April Calendar and so on. Chopping the life cycle into smaller installments can keep the event count in check.
Another possibility is snapshots, with an additional trick to it: each snapshot is annotated with metadata that describes where in the stream the snapshot was made, and you simply read the stream forward from that point.
This, of course, depends on having an implementation of an event stream that supports random access, or an implementation of stream that allows you to read last in first out.
Keep in mind that both of these are really performance optimizations, and the first rule of optimization is... don't.

So I'm trying to figure out the structure behind general use cases of a CQRS+ES architecture and one of the problems I'm having is how aggregates are represented in the event store
The event store in a DDD project is designed around event-sourced Aggregates:
it provides the efficient loading of all events previously emitted by an Aggregate root instance (having a given, specified ID)
those events must be retrieved in the order they where emitted
it must not permit appending events at the same time for the same Aggregate root instance
all events emitted as result of a single command must be all appended atomically; this means that they should all succeed or all fail
The 4th point could be implemented using transactions but this is not a necessity. In fact, for scalability reasons, if you can then you should choose a persistence that provides you atomicity without the use of transactions. For example, you could store the events in a MongoDB document, as MongoDB guaranties document-level atomicity.
The 3rd point can be implemented using optimistic locking, using a version column with an unique index per (version x AggregateType x AggregateId).
At the same time, there is a DDD rule regarding the Aggregates: don't mutate more than one Aggregate per transaction. This rule helps you A LOT to design a scalable system. Break it if you don't need one.
So, the solution to all these requirements is something that is called an Event-stream, that contains all the previous emitted events by an Aggregate instance.
So I would have an Inventory aggregate
The DDD has higher precedence than the Event-store. So, if you have some business rules that force you to decide that you must have a (big) Inventory aggregate, then yes, it would load ALL the previous events generated by itself. Then the InventoryItem would be a nested entity that cannot emit events by itself.
That seems like it would allow for easily enforcing domain rules, but I see one major flaw to this; when applying those events to the aggregate root, you would have to first rebuild that collection of InventoryItem. Even with snapshotting, that seems be very inefficient with a large number of items.
Yes, indeed. The simplest thing would be for us to all have a single Aggregate, with a single instance. Then the consistency would be the strongest possible. But this is not efficient so you need to better think about the real business requirements.
Another method would be to have one stream per InventoryItem tracking all events pertaining to only item. Each stream is named with the ID of that item. That seems like the simpler route, but now how would you enforce domain rules like ensuring product codes are unique or you're not putting multiple items into the same location?
There is another possibility. You should model the assigning of product codes as a Business Process. For this you could use a Saga/Process manager that would orchestrate the entire process. This Saga could use a collection with an unique index added to the product code column in order to ensure that only one product uses a given product code.
You could design the Saga to permit the allocation of an already-taken code to a product and to compensate later or to reject the invalid allocation in the first place.
It seems like you would now have to bring in a Read model, but isn't the whole point to keep commands and query's seperate? It just feels wrong.
The Saga uses indeed a private state maintained from the domain events in an eventual consistent state, just like a Read-model but this does not feel wrong for me. It may use whatever it needs in order to bring (eventually) the system as a hole to a consistent state. It complements the Aggregates, whose purpose is to not allow the building-blocks of the system to get into an invalid state.

CQRS & event sourcing can I use an auto incremented INT as the aggregate ID?

I am working on a legacy project and trying to introduce CQRS in some places where it's appropriate. In order to integrate with all of the legacy which is relational I would like to project my aggregate (or part of it) into a table in the relational database.
I would also like the aggregate ID to be the auto-incremented value on that projected table. I know this seems like going against the grain since it's mixing the read model with the write model. However I don't want to pollute the legacy schema with foreign key GUUIDs.
Would this be a complete no-no, and if so what would you suggest?
Edit: Maybe I could just store the GUUID in the projected table, that way when the events get projected I can identify the row to update, but then still have an auto incremented column for joining on?

There is nothing wrong with using an id created by the infrastructure layer for your entities. This pattern is commonly used in Vaughn Vernon's 'Implementing DDD' book:
Get the next available ID from the repository.
Create an entity.
Save the entity in the repository.
Your problem is that you want to use an id created in another Bounded Context. That is a huge and complete no-no, not the fact that the id is created by the Infrastructure Layer.
You should create the id in your Bounded Context and use it to reference the aggregate from other Contexts (just as you wrote when you edited your question).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string