Azure Cosmos DB Update Pattern - azure

I have recently started using Cosmos DB for a project and I am running into a few design issues. Coming from a SQL background, I understand that related data should be nested within documents on a NoSQL DB. This does mean that documents can become quite large though.
Since partial updates are not supported, what is the best design pattern to implement when you want to update a single property on a document?
Should I be reading the entire document server side, updating the value and writing the document back immeadiately in order to perform an update? This seems problematic if the documents are large which they inevitably would be if all your data is nested.
If I take the approach of making many smaller documents and infer relationships based on IDs I think this would solve the read/write immeadiately for updates concern but it feels like I am going against the concept of a NoSQL and in essence I am building a relational DB.
Thanks

Locking and latching. That's what needs to happen if partial updates become possible. It's a difficult engineering problem to keep a <15ms write latency SLA with locking.
This seems problematic if the documents are large which they inevitably would be if all your data is nested.
Define your fear — burnt Request Units, app host memory, ingress/egress network traffic? You believe this is a problem but you're not stating concrete results. I'm not saying you're wrong or doubting the efficiency of the partial update approach, i'm just saying the argument is thin.
Usually you want to JOIN nothing in NoSQL, so i'm totally with you on the last paragraph.

Whenever you are trying to create a document try to consider this:
Does the part of document need separate access . If yes then create a referenced document and if no then create a embedded document.
And if you want to know what to choose, i think you should need to take a look at this question its for MongoDb but will help you Embedded vs Referenced Document

Embed or Reference is the most common problem I face while designing document structure in NoSQL world.
In embedded relationship, child entities has been embedded into the parent document. In Reference relationship, child entities in separate documents and their parent in another document, basically having two (or more) types of documents.
There is no one relationship pattern fits all. The approach you should take depends on the Retrieve and Update to be done on the data is being designed.
1.Do you need to retrieve all the child entities along with the parent entities? If Yes, use embedded relationships.
2.Do your use case allow entities being retrieved individually? This case use relationship pattern.
Majority of the use cases I have worked, I used relationship pattern. For example: Social Graph (Profiles with Relationship Tree), Proximity Points (GeoJSON based proximity search), Classified Listing etc.
Relationship Pattern is also easier to update and maintain, as the entities are stored in individual documents.

Partial Updates are now supported by Cosmos DB:
Azure Cosmos DB Partial Document Update feature (also known as Patch
API) provides a convenient way to modify a document in a container.
Currently, to update a document the client needs to read it, execute
Optimistic Concurrency Control checks (if necessary), update the
document locally and then send it over the wire as a whole document
Replace API call.
Partial document update feature improves this experience
significantly. The client can only send the modified properties/fields
in a document without doing a full document replace operation
Read more here: https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update

Related

How to design to save a field of a aggregate root in DDD

I learned DDD recently, we used to encapsulate the creation, update, deletion in to the repository to persist the changes to the DB.
With ORM tools, we can ignore the detail of the persistence, usually the argument of the repository is an aggregate root object, and the ORM execute the conversion of the persistence(for example, it will update one field if there just one change).
But if without ORM, there is just a field of the aggregate root object changed and save it to DB, how to design this for repository? support a method to save this field? There is a method called update to save all properties, but with it, it will cause performance issue.
To persist changes only you need to know what changed, obviously. There are two common ways to achieve this:
Track changes as they occur. This strategy is easier to implement when the entity explicitely participates to the change tracking mecanism. For instance, with Event Sourcing the Aggregate Root would record uncommitted change event(s) in a collection for all commands it processed.
Dirty checking: compare the new state to the old state. Note that the old state may be cached for performance optimizations.
Generally you need another Repository. How it is implemented is up to you.
You can write the code so it is able to save/update just single fields when they change.
If you want to update single fields as they change one way to do this is to use an Observer to "observe mutations" in your objects. This approach can have two "operation modes"
Ad-hoc: When a field gets updated persist just this field's value right away.
Aggregate update: Gather the information of all updated fields (just the fact that they were updated not the data). Then update them all at once when the time comes
This approach can have other performance implications in a large system. You have to see if it suits you or not.
Another option would be to have your ORM recognize the changed fields at the time of the update via a comparison. This again has its own performance implications since you will have to fetch the DB object (aggregate) once more and compare it against the runtime changes.
How you actually implement any of these heavily depends on the language you're using and its utilities. Performance issues also heavily depend on the language/runtime platform/3rd party software and lots of other things.

Naming convention for design documents in large CouchDB database

I have a very large couchDB database that I host on Cloudant. One of the early noob mistakes I made was keep all my views under one design document. When I made a change to the design document by adding a new view, it would compile the design document again and make the database unavailable for a while.
After I talked to Cloudant, they told me it's good practice to have multiple design documents, and after doing some reading, it looks like CouchDB runs one view server per design document.
Now as in true startup fashion, we are constantly adding new features and hence new updates to the database (which is in production). Whenever I want to add a new view, I make a new design document and add the view to it.
With that background two questions.
Is this the right approach?
What naming scheme should my design documents follow?
You can have a master design document that provides a rewrite to another design document that contains the actual view you want to execute. The master design document shouldn't have any views so you can feel free to update that as often as you need. With this approach, the naming convention is up to you as long as you reference it correctly in the main design document's rewrite rules.
It's certainly not a bad approach. Given that views within a design doc are processed together, more design documents gives you greater parallelism when building views (assuming the cluster can handle it). You could also look at using Cloudant Query which provides an abstraction layer over map/reduce so you don't need to care about your design doc names.
In general, I would advise giving your design documents meaningful names - if you do need to add new views to an existing design doc, you can use this trick.

Retrieval of child objects of aggregates in DDD

In DDD root of an aggregate is the only reference to retrieve its child objects. Repository of root of an aggregate is responsible for giving the root object reference only. If I need child objects then need to call a getter method of the aggregate to retrieve the child objects which results in a DB query.
Consider a case where I am retrieving multiple aggregates from DB. So in my case this situation results in multiple DB queries which leads a very slow request. How to avoid this in terms of DDD. For persisting I came across a pattern called Unit Of Work. Is there any pattern for the search which resolves my problem or any other way to do this.
First of all, 95% of problems are solved by your ORM (if you happen to use relational database).
Aggregate root repository should (in most cases) return a fully loaded object with all child objects (entities). Lazy loading children should be an exception, not a rule.
Another thing is, you should avoid loading and persisting multiple aggregates at a time. Try repartitioning you domain so that each user interaction deals with only one aggregate.
And consider a document database solution. It really makes sanes to store whole aggregates as documents in a doc database.
Okey it seems like you have a scenario where you in a single use case want to read from several AR and also savee their state into DB. Is the read operation taking to long? or is it both read and write that takes time?
Your domain model and Aggregate roots should be partly defined through interation from use cases. What I'm saying is that, the model should be designed so it suits your clients needs. This scenario seems not like one that fits your model well.
Reports or other operations that uses a large data view should be bypasses the domain model. Don't use DDD for reports etc. Just do a fast data access.
Second. Unit of work is one way to go, if you want all aggregates to participate in a transaction.
Third. I would say, Use Lazy loading, but some use cases that need performance boost you can do a loading strategy which means you let the root load some child collections without having sql-sub-selects firing...
look at this article http://weblogs.asp.net/fredriknormen/archive/2010/07/25/loading-strategy-for-entity-framework-4-0.aspx (even its for EF pattern works well for NH ORM)
Then at last you can always provide db indexes, caching etc to boost perfomance, but given the scenario info, you have takensome kind of wrong design desicion. I don't havee all the facts but maybe some use cases aren't suitable for
I find DDD excellent when it comes to any kind of write operation. For Querying data instead, it only poses unnecessary restrictions.
I would strongly recommend using CQRS as general architecture pattern. This would allow you to create specific Query Models for your Views and leave DDD for input validation and Command execution.

Principles of putting views into the same design document in CouchDB?

When creating views in CouchDB, how do you guys determine which design document to use for newly created views? That is, by what principles to determine if 2 or more views are put into the same design document?
Internally, the following things happen.
When CouchDB needs to update a view with new data, it will update all views in a design document at the same time, as an optimization.
If you change anything inside the design document views space (even changing whitespace or comments in your Javascript), CouchDB will discard the old index and rebuild the view from scratch.
Every update in a database must pass all validate_doc_update() functions from all design documents in the database.
For these reasons, it's best to consider one design document as one application.
One exception I personally use is a _design/couchdb document which has common views such as showing me all document conflicts.
I don't have much experience with couch but in general, it's a good idea to map an application to a design document. So, if you have a database foo accessed by an application bar, you'd have a bar design document inside foo which will contain all the views with that bar will need each named according to what they serve.
The guide contains some information how to put design documents in the right places.

Are there any good patterns for handling list of entities

In "DDD" what is the best patterns for handling different versions of your entities, e.g. Entities in a list vs the full object. I would like to avoid the overhead of getting properties I do not need when displaying the entities in a list
Would you have a separate entity type used in lists or just fill up your full entity type partially?
Would you use inheritance?
I understand your urge to create "views" of models in the domain, but would recommend against it. Personally, I use the entire entity inside of the domain, regardless of the situation. The entity is the entity, and anything less or more just does not feel clean. That does not mean that I can't use a reference to the entity to help focus my use of the items in the list, though.
The entity does not cross the domain boundary in my implementation. Instead, I return a type of DTO and have application services that can abstract a view from it. This allows, for example, allowing a presenter to generate the correct view model from a DTO and provide it to the view. I don't know if you are talking about operations in the domain services or in the application services, but there are a couple of things you can do that could be applied to either (or both).
You can do certain things to reduce the performance penalty of working with the entire entity in the domain layers, as well. One thing to look at is implementing some sort of cache-aside implementation. When an entity is requested, check to see if it is cached. If it is, return the cached version. If it isn't, pull it and then cache it before returning. When the entity is updated, evict it from the cache and do your update. I have purposely created my concrete repository implementations to be cache-aware to facilitate this. One other thing to consider using an approach like this is that it is beneficial to do as many fine-grained operations as possible. While that seems illogical at first, if entities are commonly "gotten" from your data store, it is easy to set up some logging to measure the number of cache hits to cache misses.
Coming full circle, to your question... Most lists I deal with are small, so I incur the penalty of loading up the entity in its entirety. Assuming that most use cases will involve the user drilling into one or more of the items, they are pre-cached because of the cache-aside implementation. The number of items is fluid, but I generally apply this approach to anything less than twenty five entities in a list.
For larger lists, I just use IDs. Most likely, the use case here is some sort of search result. Search results are commonly paged, for example, and this does not fit into the above pattern. Instead, I use the larger list of IDs as a sliding range window of entities I am interested in that I then pass to a GetRangeById() method that all of my repositories have - written to purposely take a list of identifiers and load them one at a time so they are cached. In essence, this will take a larger lightweight list and zero in just on the area I am interested in at a given point in time.
With an approach like this, the important thing to realize is that it is highly scalable. It might not baseline as fast as a non-cached approach with small sets of data, but will perform better with larger sets of data. There is an implied performance overhead of operation at play here, but it degrades at a slower rate than a standard "load 'em up" pattern, as well.
You can use CQRS pattern to separate query processing and command processing. And you can do it even on a single database. In such a case you would map you view models directly to the tables in databse (via NHibernate for example). Commands (writes) would go through real domain model and would be persisted in the DB. Queries (like get me a list of entities) would bypass the domain a go straight do DB. There is no point in querying domain object because you actually don't invoke any business logic in them, just retrieving some data.
You can also extend this solution to full-featured CQRS by having separate stores for command side and for query side. Query side would be synchronized by means of replication or pub/sub messaging.

Resources