Domain Objects containing lots of Data

Domain Objects containing lots of Data - domain-driven-design

Our Domain has a need to deal with large amounts (possibly more than 1000 records worth) of objects as domain concepts. This is largely historical data that Domain business logic needs do use. Normally this kind of processing depends on a Stored Procedure or some other service to do this kind of work, but since it is all intimately Domain Related, and we want to maintain the validity of the Model, we'd like to find a solution that allows the Aggregate to manage all of the business logic and rules required to work with the data.
Essentially, we're talking about past transaction data. Our idea was to build a lightweight class and create an instance for each transaction we need to work with from the database. We're uncomfortable with this because of the volume of objects we'd be instantiating and the potential performance hit, but we're equally uncomfortable with offloading this Domain logic to a stored procedure since that would break the consistency of our Model.
Any ideas on how we can approach this?

"1000" isn't really that big a number when it comes to simple objects. I know that a given thread in the system I work on may be holding on to tens of thousands of domain objects at a given time, all while other threads are doing the same at the same time. By the time you consider all of the different things going on in a reasonably complicated application, 1000 objects is kind of a drop in the bucket.
YMMV depending on what sort of resources those objects are holding on to, system load, hard performance requirements, or any number of other factors, but if, as you say, they're just "lightweight" objects, I'd make sure you actually have a performance problem on your hands before you try getting too fancy.

Lazy loading is one technique for mitigating this problem and most of the popular object-relational management solutions implement it. It has detractors (for example, see this answer to Lazy loading - what’s the best approach?), but others consider lazy loading indispensable.
Pros
Can reduce the memory footprint of your aggregates to a manageable level.
Lets your ORM infrastructure manage your units of work for you.
In cases where you don't need a lot of child data, it can be faster than fully materializing ("hydrating") your aggregate root.
Cons
Chattier that materializing your aggregates all at once. You make a lot of small trips to the database.
Usually requires architectural changes to your domain entity classes, which can compromise your own design. (For example, NHibernate just requires you to expose a default constructor make your entities virtual to take advantage of lazy loading - but I've seen other solutions that are much more intrusive).
By contrast, another approach would be to create multiple classes to represent each entity. These classes would essentially be partial aggregates tailored to specific use cases. The main drawback to this is that you risk inflating the number of classes and the amount of logic that your domain clients need to deal with.

When you say 1000 records worth, do you mean 1000 tables or 1000 rows? How much data would be loaded into memory?

It all depends on the memory footprint of your objects. Lazy loading can indeed help, if the objects in question references other objects which are not of interest in your process.
If you end up with a performance hog, you must ask yourself (or perhaps your client) if the process must run synchronously, or if it can be offloaded to a batch process somewhere else.
Using DDD, How Does One Implement Batch Processing?

Related

DDD: creating multiple aggregates with a shared life-cycle in a single transaction

I'm aware of the general rule that only a single aggregate should be modified per transaction, mostly for concurrency and transactional consistency issues, as far as I'm aware.
I have a use case where I want to create multiple aggregates in a single transaction: a RestaurantManager, a Restaurant, and a Menu. They seem like a single aggregate because their life-cycles begin and end together: it doesn't make sense within the domain to create a RestaurantManager without a Restaurant, or vice versa; the same goes for a Restaurant and a Menu. Further, if the Restaurant or the RestaurantManager is deleted (unregistered), they should all be deleted together.
However, I've split them into separate aggregates because, once created, they are updated separately, maintain their own invariants, and I don't want to load them all into memory just to update one property on the Restaurant, for example.
The only thing that ties them together is their life-cycle.
My question is whether this represents a case where it is okay to go against the "rule" that each transaction should only operate on a single aggregate.
I'd also like to know if I should enforce their shared life-cycle in the domain model by having each aggregate root hold the identifier of the aggregate root it depends on, i.e. by having Restaurant require a MenuId as a constructor parameter, and likewise for Menu and RestaurantId, so that neither can be created without the other. However, this still wouldn't enforce that they should be saved together by the application service anyway, since it could create them all in memory, then only save the Menu, for example.

Your requirement is a pretty normal use case in DDD, IMHO. There are always multiple aggregates working in tandem to support the application, and they are interlinked in their lifecycles. But the modeling concepts still stand true. Let me attempt to explain what your model would look like with the help of a few DDD rules:
Aggregates are transaction boundaries
Aggregates ensure that no business invariants are broken at any point. This means that if you have multiple aggregates strung together as part of one transaction, you have to load all of them into memory for the validation.
This is especially a problem when your application is data-rich and stores data in a database cluster - partitioned, distributed (think Mongo or Elasticsearch). You will have the problem of loaded up data from potentially different clusters as part of a single transaction.
Aggregates are loaded in entirety
Aggregates and their associated data objects are loaded in entirety into memory. This means that unnecessary objects (say the restaurant's schedule for the upcoming month, for example) for the transaction may be loaded into memory. By itself, this is not a problem. But when multiple aggregates get together, the amount of data loaded into memory needs to be considered.
Aggregates refer to each other by their unique identifiers
This one is straightforward and means that each aggregate stores its referenced aggregates by their identifiers instead of enclosing the other aggregate's data within it.
State changes across Aggregates are handled through Domain Events
In cases where you want a state change in one aggregate to have side-effects on other aggregates, you publish a domain event, and a subscriber handles the change on other aggregates in the background. This is how you would want to handle your requirement for cascade deletes.
By following these rules, you are essentially zooming in one single aggregate at a time and ensuring that the complexity remains low. When you string up multiple aggregates, though it is clear and understandable on day 1, eventually, the application tends towards becoming a big ball of mud, as dependencies and invariants start crisscrossing each other.

"only a single aggregate should be modified per transaction"
Contention at creation doesn't matter as much. You can create many ARs in a single transaction without problem because the only other operation that could conflict is another duplicate creation process.
Another reason to avoid involving many ARs in a single transaction is coupling between modules though, but you could always keep things loosely coupled using synchronously dispatched domain events.
As for the deletion, it's probably less problematic to make it eventually consistent. Does it really matter that Restaurant is closed while RestaurantManager remains registered for a short period of time?
The fact you are asking this question tells me your system is not distributed? If your system is running with a single DB server and used by a few people it may be that eventual consistency make things more complex for scalability you don't actually need.
Start simple and refactor as needed, but crossing AR boundaries is not something that should be done consistently or else your boundaries are clearly wrong.
Furthermore, if you want to communicate that a RestaurantManager can't be spawned from nowhere and associated with an invalid RestaurantId by mistake you may want to look at your ubiquitous language for guidance.
e.g.
"A RestaurantManager is registered for a given Restaurant": not sure it truly aligns with your UL, but it's just for the sake of the example.
RestaurantManager manager = restaurant.registerManager(...);
This obviously increases coupling and could affect performance, but it aligns well with the UL and makes it more difficult to misuse the model. Also note that with a single DB, you could enforce referential integrity which takes cares of these uninteresting referential constraints.

As pointed out by #plalx, contention doesn't matter as much when creating aggregates in terms of transactions, since they don't yet exist so can't be involved in contention.
As for enforcing the mutual life cycle of multiple aggregates in the domain, I've come to think that this is the responsibility of the application layer (i.e. an application service, or use case).
Maybe my thinking is closer to Clean or Hexagonal architecture, but I don't think it's possible or even sensible to try and push every single business rule down into the "domain model". The point of the domain model for me is to partition the problem domain into small chunks (aggregates), which encapsulate common business data/operations that change together, but it's the application layer's responsibility to use these aggregates properly in order to achieve the business' end goal (which is the application as a whole), including mediating operations between the aggregates and controlling their life cycles.
As such, I think this stuff belongs in an application service. That being said, frequently updating multiple aggregates in each use case could be a sign of incorrect domain boundaries.

What persistence problems are solved with CQRS?

I've read a few posts relating to this, but i still can't quite grasp how it all works.
Let's say for example i was building a site like Stack Overflow, with two pages => one listing all the questions, another where you ask/edit a question. A simple, CRUD-based web application.
If i used CQRS, i would have a seperate system for the read/writes, seperate DB's, etc..great.
Now, my issue comes to how to update the read state (which is, after all in a DB of it's own).
Flow i assume is something like this:
WebApp => User submits question
WebApp => System raises 'Write' event
WriteSystem => 'Write' event is picked up and saves to 'WriteDb'
WriteSystem => 'UpdateState' event raised
ReadSystem => 'UpdateState' event is picked up
ReadSystem => System updates it's own state ('ReadDb')
WebApp => Index page reads data from 'Read' system
Assuming this is correct, how is this significantly different to a CRUD system read/writing from same DB? Putting aside CQRS advantages like seperate read/write system scaling, rebuilding state, seperation of domain boundaries etc, what problems are solved from a persistence standpoint? Lock contention avoided?
I could achieve a similar advantage by either using queues to achieve single-threaded saves in a multi-threaded web app, or simply replicate data between a read/write DB, could i not?
Basically, I'm just trying to understand if i was building a CRUD-based web application, why i would care about CQRS, from a pragmatic standpoint.
Thanks!

Assuming this is correct, how is this significantly different to a CRUD system read/writing from same DB? Putting aside CQRS advantages like seperate read/write system scaling, rebuilding state, seperation of domain boundaries etc, what problems are solved from a persistence standpoint? Lock contention avoided?
The problem here is:
"Putting aside CQRS advantages …"
If you take away its advantages, it's a little bit difficult to argue what problems it solves ;-)
The key in understanding CQRS is that you separate reading data from writing data. This way you can optimize the databases as needed: Your write database is highly normalized, and hence you can easily ensure consistency. Your read database in contrast is denormalized, which makes your reads extremely simple and fast: They all become SELECT * FROM … effectively.
Under the assumption that a website as StackOverflow is way more read from than written to, this makes a lot of sense, as it allows you to optimize the system for fast responses and a great user experience, without sacrificing consistency at the same time.
Additionally, if combined with event-sourcing, this approach has other benefits, but for CQRS alone, that's it.
Shameless plug: My team and I have created a comprehensive introduction to CQRS, DDD and event-sourcing, maybe this helps to improve understanding as well. See this website for details.

A good starting point would be to review Greg Young's 2010 essay, where he tries to clarify the limited scope of the CQRS pattern.
CQRS is simply the creation of two objects where there was previously only one.... This separation however enables us to do many interesting things architecturally, the largest is that it forces a break of the mental retardation that because the two use the same data they should also use the same data model.
The idea of multiple data models is key, because you can now begin to consider using data models that are fit for purpose, rather than trying to tune a single data model to every case that you need to support.
Once we have the idea that these two objects are logically separate, we can start to consider whether they are physically separate. And that opens up a world of interesting trade offs.
what problems are solved from a persistence standpoint?
The opportunity to choose fit for purpose storage. Instead of supporting all of your use cases in your single read/write persistence store, you pull documents out of the key value store, and run graph queries out of the graph database, and full text search out of the document store, events out of the event stream....
Or not! if the cost benefit analysis tells you the work won't pay off, you have the option of serving all of your cases from a single store.

It depends on your applications needs.
A good overview and links to more resources here: https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
When to use this pattern:
Use this pattern in the following situations:
Collaborative domains where multiple operations are performed in parallel on the same data. CQRS allows you to define commands with
enough granularity to minimize merge conflicts at the domain level
(any conflicts that do arise can be merged by the command), even when
updating what appears to be the same type of data.
Task-based user interfaces where users are guided through a complex process as a series of steps or with complex domain models.
Also, useful for teams already familiar with domain-driven design
(DDD) techniques. The write model has a full command-processing stack
with business logic, input validation, and business validation to
ensure that everything is always consistent for each of the aggregates
(each cluster of associated objects treated as a unit for data
changes) in the write model. The read model has no business logic or
validation stack and just returns a DTO for use in a view model. The
read model is eventually consistent with the write model.
Scenarios where performance of data reads must be fine tuned separately from performance of data writes, especially when the
read/write ratio is very high, and when horizontal scaling is
required. For example, in many systems the number of read operations
is many times greater that the number of write operations. To
accommodate this, consider scaling out the read model, but running the
write model on only one or a few instances. A small number of write
model instances also helps to minimize the occurrence of merge
conflicts.
Scenarios where one team of developers can focus on the complex domain model that is part of the write model, and another team can
focus on the read model and the user interfaces.
Scenarios where the system is expected to evolve over time and might contain multiple versions of the model, or where business rules
change regularly.
Integration with other systems, especially in combination with event sourcing, where the temporal failure of one subsystem shouldn't
affect the availability of the others.
This pattern isn't recommended in the following situations:
Where the domain or the business rules are simple.
Where a simple CRUD-style user interface and the related data access operations are sufficient.
For implementation across the whole system. There are specific components of an overall data management scenario where CQRS can be
useful, but it can add considerable and unnecessary complexity when it
isn't required.

Synchronizing Query-side Data in CQRS - won't there still be contention?

I have a general question about the CQRS paradigm in general.
I understand that a CommandBus and EventBus will decouple the domain model from our Query-side datastore, the merits of eventual consistency, and being able to denormalize the storage on the Query side to optimize reads, etc. That all sounds great.
But I wonder as I begin to expand the number of the components on the Query side responsible for updating the Query datastore, if they wouldn't start to contend with one another to perform their updates?
In other words, if we tried to use a pub/sub model for the EventBus, and there were a lot of different subscribers for a particular event type, couldn't they start to contend with one another over updating various bits of denormalized data? Wouldn't this put us in the same boat as we were before CQRS?
As I've heard it explained, it sounds like CQRS is supposed to do away with this contention all together, but is this just an ideal, and in reality we're only really minimizing it? I feel like I could be missing something here, but can't put my finger on it.

it all depends on how you have designed the infrastructure. Strictly speaking, CQRS in itself doesn't say anything about how the Query models are updated. Using Events is just a one of the options you have. CQRS doesn't say anything about dealing with contention either. It's just an architectural pattern that leaves you with more options and choices to deal with things like concurrency. In "regular" architectures, such as the layered architecture, you often don't have these options at all.
If you have scaled your command processing component out on multiple machines, you can assume that they can produce more events than a single event handling component can handle. That doesn't have to be a bad thing. It may just mean that the Query models will be updated with a slightly bigger delay during peak times. If it is a problem for you, then you should consider scaling out the query models too.
The Event Handler component themselves will not be contending with each other. They can safely process events in parallel. However, if you design the system to make them all update the same data store, your data store could be the bottleneck. Setting up a cluster or dividing the query model over different data sources altogether could be a solution to your problem.
Be careful not to prematurely optimize, though. Don't scale out until you have the figures to prove that it will help in your specific case. CQRS based architectures allow you to make a lot of choices. All you need to do is make the right choice at the right time.
So far, in the application's I am involved with, I haven't come across situations where the Query model was a bottleneck. Some of these applications produce more than 100mln events per day.

Why not split the data access layer into two?

Everywhere I look, I noticed that both Domain Driven Design (DDD) and entity hydration approaches attempt to populate entities directly from the data layer. I disagree with such approaches. It is not because these approaches do not work because these do. Instead, I would argue that such approaches give a low level of transparency for testing purposes. I propose that at the data access layer, data is retrieved to populate dictionaries instead of the directly populating the entities themselves. There are several reasons for this:
First, there is greater flexibility. A dictionary per result set could be populated. We would decide later which entities could be populated from these result sets.
Second, less knowledge about the data layer is needed to determine where data retrival is failing. We may still write tests for verify data retrieval without having to understand anything about its associated complex domain entity factories.
There is one so-called disadvantage, performance? Going through two layers is slower than going through one? Yes, it is but the performance gain from going through a single data layer is negliable here. The reason I say this is because both the dictionaries and the entries these dictionaries would populate would be cached. So, if anything there would be a memory overhead. I think this would be worthwhile to gain the two advantages stated above.

It seems like testing is the issue ("for testing purposes"), so I suggest you use repositories just like #tschmuck pointed out.
As Ayende points out, they might give you unnecessary lasagna code (i.e. too many layers), but they will give you flexibility. You can implement fakes/test spies yourself, mock and stub 'em, as well as use an in-memory DB such as SQLite, and the dependent class is just as happy.

Are there any good patterns for handling list of entities

In "DDD" what is the best patterns for handling different versions of your entities, e.g. Entities in a list vs the full object. I would like to avoid the overhead of getting properties I do not need when displaying the entities in a list
Would you have a separate entity type used in lists or just fill up your full entity type partially?
Would you use inheritance?

I understand your urge to create "views" of models in the domain, but would recommend against it. Personally, I use the entire entity inside of the domain, regardless of the situation. The entity is the entity, and anything less or more just does not feel clean. That does not mean that I can't use a reference to the entity to help focus my use of the items in the list, though.
The entity does not cross the domain boundary in my implementation. Instead, I return a type of DTO and have application services that can abstract a view from it. This allows, for example, allowing a presenter to generate the correct view model from a DTO and provide it to the view. I don't know if you are talking about operations in the domain services or in the application services, but there are a couple of things you can do that could be applied to either (or both).
You can do certain things to reduce the performance penalty of working with the entire entity in the domain layers, as well. One thing to look at is implementing some sort of cache-aside implementation. When an entity is requested, check to see if it is cached. If it is, return the cached version. If it isn't, pull it and then cache it before returning. When the entity is updated, evict it from the cache and do your update. I have purposely created my concrete repository implementations to be cache-aware to facilitate this. One other thing to consider using an approach like this is that it is beneficial to do as many fine-grained operations as possible. While that seems illogical at first, if entities are commonly "gotten" from your data store, it is easy to set up some logging to measure the number of cache hits to cache misses.
Coming full circle, to your question... Most lists I deal with are small, so I incur the penalty of loading up the entity in its entirety. Assuming that most use cases will involve the user drilling into one or more of the items, they are pre-cached because of the cache-aside implementation. The number of items is fluid, but I generally apply this approach to anything less than twenty five entities in a list.
For larger lists, I just use IDs. Most likely, the use case here is some sort of search result. Search results are commonly paged, for example, and this does not fit into the above pattern. Instead, I use the larger list of IDs as a sliding range window of entities I am interested in that I then pass to a GetRangeById() method that all of my repositories have - written to purposely take a list of identifiers and load them one at a time so they are cached. In essence, this will take a larger lightweight list and zero in just on the area I am interested in at a given point in time.
With an approach like this, the important thing to realize is that it is highly scalable. It might not baseline as fast as a non-cached approach with small sets of data, but will perform better with larger sets of data. There is an implied performance overhead of operation at play here, but it degrades at a slower rate than a standard "load 'em up" pattern, as well.

You can use CQRS pattern to separate query processing and command processing. And you can do it even on a single database. In such a case you would map you view models directly to the tables in databse (via NHibernate for example). Commands (writes) would go through real domain model and would be persisted in the DB. Queries (like get me a list of entities) would bypass the domain a go straight do DB. There is no point in querying domain object because you actually don't invoke any business logic in them, just retrieving some data.
You can also extend this solution to full-featured CQRS by having separate stores for command side and for query side. Query side would be synchronized by means of replication or pub/sub messaging.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string