Storing and updating read model in a CQRS + ES system

Storing and updating read model in a CQRS + ES system - domain-driven-design

Background
I have a system that is using CQRS + ES and in this system there are aggregates such as blog posts or issues that are persisted in the event store and send events over to the query side to persist the read model via projections.
In the case of a an issue or post being created it is a fairly straight forward
Client creates a command to create a new issue
A command handler creates a new issue aggregate and saves the changes in the event store
When the aggregate applies the change it fires off an IssueCreatedEvent or similar
A projection on the read side will listen for this and create an issue model and any other denormalised data that it wants (such as a cutdown IssueListItem for querying a list of all issues)
If a change is made to the issue and an appropriate event is raised on the write side such as IssueStatusChanged and handled on the read side accordingly. Load up the two de-normalised models on the read side update the status from the event and save. Easy.
How do you handle relations such as comments?
I am implementing a comment system where users can post comments on an issue or a blog post. My first thought was to have these comments added to the issue or post aggregate on the write side for consistency. When I thought about this though I realised that this would likely introduce a lot of unnecessary concurrency issues like when somebody is updating an issue and someone else has come and posted a new comment.
This led me to think that I should model comments themselves as their own aggregate root. This way comments posted to a blog post or issue will not cause conflicts with the issue itself.
So assuming I model comments on the write side as aggregates in this way, I have two questions;
1) Does the issue or post aggregate on the write side still need to store this relationship? The comment aggregate itself already stores which item it was posted too with an id reference.
If so, I was thinking of having the issue aggregate subscribe to the comment created event and add its own reference.
public class Issue : AggregateRoot, IEventHandler<CommentCreatedEvent>
{
private ICollection<Guid> _Comments;
public void Handle(CommentCreatedEvent #event)
{
_Comments.Add(#event.AggregateId)
}
}
Is this sufficient or not needed since the comment already stores a reference to its parent? This data isn't really needed on the write side and is more important on the read side when it is the parent that is loaded up with all comments.
2) On the read side what is the best way to store this data?
Specifically, in order to make this data easy to update I would need to put in another table for comments and join them to the appropriate post or issue. After I have finished with comments I will be implementing a following system where users can follow an item to receive updates. Going down this path however will very quickly lead me back to a highly normalised schema on the read side which defeats the purpose of an optimised, denormalised read model.
I was therefor thinking of adding a single column to the issue table for example that stored all comments as a serialised json clob or something. That way when changes come in to comments I can still pull out one record to load up the issue, make the appropriate changes to the comments (such as updating an existing comment, adding a new comment or removing one) and re-saving the record. From a read perspective, the entire issue can still be retrieved in one go.
The problem I see with this approach is that if a user changes their profile picture or profile name for example, I would have to load up every single issue and/or post, load up the comments and make the appropriate changes in the comment info.
I also wonder how document databases (something else I have been considering for the read side) get around this issue of updating nested data?

I'm a bit late at the party however, here's my take about no 2.
The best way to store the read model is in a way that it's very easy to query. A document db can be a good technical solution but it works with a rdbms as well, providing you have the relevant read model schema defined.
You can store all comments together with the post, however this is not always the case, because high traffic sites are loading comments separately from the posts via ajax. So it really depends on the read model use cases.

Question1: No need to have the relationships in Issue. No particular consistency to be protected here.
Question2: I'm recently reading NoSQL distilled. It seems Column-Family database like Casandra is suitable for the comments.
Row | issueId | name | comments |
| 1 | comments persistence solution | {c1,c2,c3} |
You could use Casandra api or Casandra query language to retrieve a subset of comments or the entire comments column.
UPDATE
is that comments column just a serialised collection of ids, the
comments in their entirety?
No the comments are stored as columns in a row. Casandra supports nested columns. so the comments column may have structure like this
| other columns | comments |
| ............ | c1 | c2 | c3 |
| "+1" | "Nice one" | "+1" |
You can get and set any comment alone in Casandra if I'm not mistaken. In this case, you can update any one comment. Or you can get the comments column to retrieve all comments.

Question 1:
You musn't handle events in your Aggregate root. It's a bad idea that breaks the DDD principles. If Comments lives in a different aggregate then any consequence in the Issue aggregate must be handled eventually by some kind of process manager, domain service or Saga in your domain.
If possible in your domain, you must states that Issue doesn't know about Comments (I guess is a natural way of thinking here), so you shouldn't keep any reference of that kind.
Comments on the other way can keep a reference to the issue they are related to.
Question 2: why don't you keep all the fields you need from Issue/post in your Comments table (handling the Issue/post updates)? This free you from joining between this two tables when querying your read model.

Related

MongoDb slow aggregation with many collections (lookup)

i'm working on a MEAN stack project, i use too many collections in my aggregation so i use a lot of lookup, and that impacts negatively the performance and makes the execution of aggregation very slow. i was wondering if you have any suggestions , i found that we can reduce lookup by creating for each collection i need an array of objects into a globale collection however, i'm looking for an optimale and secured solution.
As an information, i defined indexes on all collections into mongo.
Thanks for sharing your ideas!

This is a very involved question. Even if you gave all your schemas and queries, it would take too long to answer, and be very specific to your case (ie. not useful to anyone else coming along later).
Instead for a general answer, I'd advise you to read into denormalization and consider some database redesign if this query is core to your project.
Here is a good article to get you started.
Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
A simple example to outline it:
Say you have a Blog with a comment collection, and a user collection
You want to display the comment with the name of the user. So you have to load the player for every comment.
Instead you could save the username on the comment collection as well as the user collection.
Then you will have a fast query to show comments, as you don't need to load the users too. But if the user changes their name, then you will have to update all of the comments with the new name. This is the main tradeoff.
If a DB redesign is too difficult, I suggest splitting into multiple aggregates and combining them in memory (ie. in your node server side code)

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?

As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

CQRS Read models in a NoSql (Mongo DB)

Hi its my fist time with DDD/CQRS. I've read multiple sources of knowledge and Im still confused a bit, maybe someone could help :)
Lets assume simple case that we have products and clients (possibly different bounded contexts).
A client can buy a product and he wants to see all products that he purchased.
In this case I realize I need a UserPurchasesView view model with:
purchaseId (which is a mongo primary key)
userId,
product: {id, name, image, shortDescription, [maybe some others]}
prize
timestamp
Now ... the problem is that My domain is producing an event like UserPurchasedProduct(userId, productId). I could enrich an event with a prize, product name or maybe something else but not all fields. Im getting to a point where enriching seems to be wrong.
In this point I realize I need something like ProductDetailsView:
productId (primary key)
prize
name
shortDescription
logo
This view is maintained by events like: ProductCreated, ProductRenamed, ProductImageChanged
And now we have 2 options ...
Look into the ProductDetailsView when UserPurchasedProduct event comes in, take all needed product details and save it in UserPurchasesView for faster reads. This solution looks not that bad but it introduces some extra coupling and it seems to me these views cannot be scaled well when needed. Also both views must be rebuilt together when replying all events from the event store (rebuilding is also more tricky in that case).
Keep only the productId in the UserPurchasesView and read multiple views when user queries his purchases. This is some extra processing that would have to be done somewhere. In the frontend, in the backend controller or in some read model high level API. UPDATE: I also realized that I would also need to keep at least the prize and maybe name of the product in the UserPurchasesView (in case it changes) but sometimes you need the value from the time of a purchase and sometimes you need the recent value. Scenario depends on a business but we could imagine both.
None of these solutions looks perfect to me. Am I wrong, am I missing something or is it just the way to do it? Thanks!

You understand well.
So you have to choose between coupling between the read models and coupling between UI and individual read models.
One of the main advantages of CQRS/ES is the posibility to create blazing fast read models (views if you like), without any joins, the perfect cache as I saw it called. I personally have chosen every time the first approach, with full data denormalisation. The views are very fast and models very clean and clear. This is the perfect solution if you want to optimize the read side of your application (and I think you should).
By listening to the right events you can keep these read models in sync with the rest of the application.

There is a 3rd option:
The projection responsible for the UserPurchasesView view not only listens to UserPurchasedProduct events, but also to ProductCreated, ProductRenamed, ProductImageChanged - any product related events that affect the UserPurchasesView. Now, as well as the UserPurchasesView collection for the read model that it is responsible for, it also needs a private collection to maintain the bits of products it is interested in: ({id, name, image, shortDescription, [maybe some others]}), so that when a new purchase event comes in, you have somewhere to get the initial state of those product fields from. Since your UserPurchasesView needs to listen to some of those product events anyway in order to keep up to date when a product changes, this isn't really much extra work, and avoids any dependency on another projection (ProductDetailsView). The cross-projection dependency also has a potential problem due to eventual consistency - what if the product isn't even in the product details view yet when the UserPurchasedProduct event comes through?
To avoid any concurrency issues, it's simplest to have each projection managed only by a single process and a single thread. That way, as long as the projection can receive events in-order across streams (so that it is guaranteed to see the product creation before the product purchase), you won't have issues with seeing a purchase before the product exists. If you introduce sharding or any other multi-threading to your projection, it gets more complicated.

CQRS design: nosql data view

This is a "language agnostic" question.
I started to study the CQRS pattern.
I've a simple question. I'm supposing to have 2 different storage layer: one relational for the commands(Mysql etc..) and one NoSql (mongo,cassandra.. etc) for the "query"?
Let me explain a little example:
1) As a user I want to insert a "Todo task"
Command: "Create Task" and will insert a new task into a database which have the User and the Todo tables.
2) As a user I'm able to see a list of created task
Query: "GetTasks" that will return a "view" with a collection of task taken from a non sql table named "UserTasks" which have a user and a list of created task.
Is the right approach? I'm sorry if the language is poor, it's just a little example.
If it seems a good approach (again, don't consider details) what is the best approach to keep updated the data stores?
I'm thinking to raise an event like "TaskCreated" and take the new task and insert those information in the nosql storage.
Thanks!

I can't really understand what you're looking for. but... typically, a command would be something that results in side effects. Queries don't cause side effects. GetTasks wouldn't really be a command, but a query.
Your "CreateTask" would be a command, which would result in the task added to the relevant data store(s). Your GetTasks query would retrieve that information from a datastore. It doesn't really matter if you're using a SQL or NoSQL store for this.
The "CommandStore" is typically the store that has just enough data to enforce invariants. In your case, what data is required for that? Is some information required to decide whether or not a task can be registered? For example, say, you have a requirement that a user can have at most 3 "todo"s. In this case, a table in the "Command Store" storing (UserId, Todo Count) is enough. You could also use (UserId, [TodoId]) - ie. store a list of todo ids so that you can gain idempotence. All other information about the user and tasks would be query data, and would be in the query store.
Hope that makes sense.

While there are times when you may wish to store commands, you generally don't. Rather a popular approach is to store the domain events that occur as a result of the commands.This is referred to as Event Sourcing. This would make 'STOREA' a store of events or to put it another way, an event stream. 'STOREB' is typically referred to as the Read Model. It has a de-normalised structure optimised for read speed. It is kept up to date via de-normalisers which respond to specific events. A key point to note here is that there is often a lag between the event being raised and the read model being updated. This in my opinion is a good thing but needs to be thought about when designing the UI.
For more info take a look at CQRS – A Step-by-Step Guide to the Flow of a typical Application
I hope that helps

Applying "tag" to millions of documents, using bulk/update methods

We have in our ElasticSearch instance about 55.000.000 of documents. We have a CSV file with user_ids, the biggest CSV has 9M entries. Our documents have user_id as the key, so this is convenient.
I am posting the question because I want to discuss and have the best option to get this done, as there are different ways to address this problem. We need to add the new "label" to the document if the user document doesn't have it yet eg tagging the user with "stackoverflow" or "github".
There is the classic partial update endpoint. This sounds way slow as we need to iterate over 9M of user_ids and issue the api call for each of them.
there is the bulk request, which provides some better performance but with limited 1000-5000 documents that can be mentioned in one call. And knowing when the batch is too large is kinda know how we need to learn on the go.
Then there is the official open issue for /update_by_query endpoint which has lots of traffic, but no confirmation it was implemented in the standard release.
On this open issue there is a mention for a update_by_query plugin which should provide some better handling, but there are old and open issues where users are complaining of performance problems and memory issues.
I am not sure it it's doable on EL, but I thought I would load all the CSV entries into a separate index, and somehow would join the two indexes and apply script that would add the tag if doesn't exists yet.
So the question remains whats the best way to do this, and if some of you have done in past this, make sure you share your numbers/performance and how you would do differently this time.

While waiting for update by query support, I have opted for:
Use the scan/scroll API to loop over the document IDs you want to tag (related answer).
Use the bulk API to perform partial updates to set the tag on every matching doc.
Additionally I store the tag data (your CSV) in a separate doc type, and query from that and tag all new docs as they are created, i.e., to not have to first index and then update.
Python snippet to illustrate the approach:
def actiongen():
docs = helpers.scan(es, query=myquery, index=myindex, fields=['_id'])
for doc in docs:
yield {
'_op_type': 'update',
'_index': doc['_index'],
'_type': doc['_type'],
'_id': doc['_id'],
'doc': {'tags': tags},
}
helpers.bulk(es, actiongen(), index=args.index, stats_only=True)

Using the aforementioned update-by-query plugin, you would simply call:
curl -XPOST localhost:9200/index/type/_update_by_query -d '{
"query": {"filtered": {"filter":{
"not": {"term": {"tag": "github"}}
}}},
"script": "ctx._source.label = \"github\""
}'
The update-by-query plugin only accepts a script, not partial documents.
As for performance and memory issues, I guess the best thing is to give it a try.

I'd go with the bulk API with the caveat that you should try to update each document the minimal number of times. Updates are just atomic deletes and adds and leave behind the deleted document as a tombstone until it can be merged out.
Sending a groovy script to execute the update probably makes the most sense here so you don't have to fetch the document first.

Could you create a Parent/Child relationship whereby you can add a 'tags' type which references your 'posts' type as its parent. This way you wouldn't need to perform a full reindex of your data - simply index each of the appropriate tags against the appropriate post ID.

A very old thread. Landed through the github page to implement "update by query" to see if it's implemented in 2.0 but unluckily not. Thanks to plugin from Teka, if the update is small, that very much doable from sense but our use case was to update million of documents daily based on certain complex queries. At the end, we moved to es-hadoop connector. Although infrastructure is a big big overhead here but parallelizing the process of fetching/updating/inserting document through spark helped us anyhow. If anyone has any other suggestion discovered :) in past one year, would love to hear on that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string