sub partitioning or composite partitioning document db - azure

In one article of msdn,
https://azure.microsoft.com/en-in/documentation/articles/documentdb-partition-data/,
there is a line which specifies that "sub-partitioning" or "complex partitioning" can be done. Does this mean :
There can be sub-partitioning inside a collection?
In a single DocumentDb, there can be more than one partitioning logic? For example, I will have four collections inside a single Document Db. Can two of them can be based on hash and the other two on range?
If either of those answers is YES, then can someone provide me a link that might lead me to an example of the same?

Answers:
There is no explicit method to sub-partition data within a collection. It's common to use a field to represent the type of document or to have isTypeA: true key value pairs on each document, but that's a convention that your application adopts. However, you can create multiple databases (default limit 5 but may be extended upon request) per account and each can have their own set of collections. I'm using that two-level hierarchy in (temporalize-api). TenantID determines my top-level partitioning (database) using a lookup table plus defaults. This allows me to pull critical or high value tenants into a less loaded database and leave everyone else in the default. I use a consistent hash on the EntityID for second-level partitioning (collection).
Sure, there is nothing preventing you from doing that. Pay particular attention to the excellent discussion in the last section (Developing a partitioned application) in the Aravind article you linked to. It includes a checklist of things you'll need to decide upon and implement. The partition resolvers provided for the .NET SDK do not take care of these issues for you.
I haven't yet seen open source examples of what I would consider a complete system including balancing when capacity is added, where to store the partition maps/meta-data, and query fan-out/aggregate optimization. I have a node.js one under way (temporalize-api) and actually in production. I've made decisions about how I'm going to do balancing and query fan-out and those are documented in the comments in that linked file, but I have not implemented all of them. I store the partition meta-data in the "first" collection of the "first" database.

Related

GAE datastore data model recommendation for nested "same kind" relations

I have followed through Bookshelf App tutorial (in node.js) by google and instead of books catalogue I would like to model a production part catalogue.
Where a part consists of "sub"-parts and tasks.
Every "sub"-part can have again "sub"-parts and tasks (manufacturing steps).
Current implementation: At the moment I have only two kinds Parts and Tasks.
A relations between the parts is managed via a property storing the unique key (parentId) of the parent part in its child part. A bigger headache I have at the moment (for example) is a price change of a highly nested sub-part would be recursively need to update all parent parts...
Question: What would be the recommended datastore design for such an application?
It should solve or be more efficient doing:
If i change a "sub-sub-sub"-parts price this need to change the price of all parent parts according the chosen calculation methodology.
Should not be limited in depth of sub-parts (I did read limits on datastore "nested entity values" to be 20 (but probably did not understand it correctly).
Should not be limited to 1 write per second per (part and all its sub-parts) "entity group". I've read about this limit but I am not sure whether this also applies to so called Transactions (which I think you can do on entity groups).
One potential solution is avoid storing aggregate prices in Datastore entirely. Instead, the "price" on each part or task should only include the cost of that thing itself, but not the sub-parts.
Instead calculate the price on the fly when needed, adding up the entire tree of parts/sub-parts/tasks. Store this in memcache if you want to speed up calculation (but make sure to delete the memcache key when updating prices).

In cosmosdb, should I reference other documents using id, resource id, or self link?

I'm working on designing my CosmosDB collections and deciding what I will and won't nest in a single document, etc. There's no way around it, though - there will be scenarios where I need to reference documents from one collection within another.
I see that in CosmosDB there are several ways to identify a document - id, resource id and self link. It looks like id is enforced to be unique and can either be set by server or to whatever you want it to be. Next, it looks like resource id is always auto generated by the server and is guaranteed to be unique as well. Last, it looks like self link is built up using the id of the database, collection and document, meaning it'll also be unique. I see three different unique keys, all having their own uses and semantics.
Which one should I use internally when referencing other documents?
What about referencing documents in different collections - would resource id or self link be more "universal identifier" than just id?
DO use natural key for id values, if possible.
DO use id for cross-document references.
DO use names for collection/database references.
DO NOT use _rid or _selflink when you need a reliable long-term reference.
Why not use _rid/selflink?
_rid - system-assigned identity in Comsos DB inner storage. It value is stable as long as document does not move in storage but it will change whenever document is recreated in storage.
_selflink - system-assigned identity similar to _rid, but in addition to _rid it includes similar resource sub-keys for the Cosmos DB database and collection the document is in. So it is a reference to the document from the account level.
First, most likely _rid/_selflink have the potential to be slightly more performant as they are closer to actual data. Though in 99% of situations it should be negligible.
On the downside, _rid/_selflink will change when you move your documents for whatever reason. E.g.,:
backup and restore
delete document and recreate this with exactly the same data
rename Cosmos DB database/collection (currently achieved by creating new and moving data)
recreate collection to get some new feature not applicable to existing collections
refactor collections structure by moving document types (for business/performance or security concerns)
Should this happen, you would be in a world of pain to discover and fix all references from within your data documents. Ouch. That's fragile and cumbersome assuming you have lots of documents and non-trivial models.
Also, if you look at Microsoft API clients (e.g., the C# client), then the comfortable path is nowadays is to work with database/collection names and ids. Don't fight it. You'll just make your code uglier and you own life harder than intended.
Using them for temporary ad-hoc identities is ok though.
Why id?
id is user-assigned key to a document with uniqueness guarantee within a partition.
It is optimized for retrieval in API and perf-wise = faster to develop, better performance.
It can be set to a natural key - human-readable and business-wise meaningful without loading the referenced document. = fewer lookups, less confusion, fewer RU/s.
It is part of the user data and will never change when you move your documents around = predictable behavior, fewer bad surprises during disaster recovery.
The only caveat is that, as always with user-given identities, you have to plan a bit to be sure the identity range really is unique enough for your needs. Your app can always set stricter uniqueness properties (though they would not be enforced by Cosmos DB) or if you need ultimate uniqueness, then use Guids.
What about containers?
Same arguments apply to containers/databases.
The id is only unique within the document partition. You could have as many documents with the same id as long as they have a different partition key values.
The _rid is indeed unique and it's the best form of identification for a document. You can achieve the same by using the id and also providing the partition key value if your collection is partitioned.
There are two different types of reading a document directly without querying for it.
Using its self link which looks like this dbs/db_resourceid/colls/coll_resourceid/documents/doc_resourceid and uses the _rid values
Using its alternative link which looks like this dbs/db_id/colls/coll_id/documents/doc_id which uses the id
The safest form of document identification you can use is the one that uses the _rids.
In both of your questions, you should go with the self link.

Cosmos DB with multiple partition keys

We're looking at potentially using a single Cosmos DB collection to hold multiple document types in a multi-tenanted environment using a tenant ID as the partition key. The path to tenant id may change in each document type and I am therefore looking at various was of exposing the partition key to Cosmos DB to enable correct partitioning / querying.
I have noticed that the Paths property of DocumentCollection.PartitionKey is a collection and was therefore wondering whether it is possible to pass multiple paths during the creation of a document collection and what the behaviour of this might be. Ideally, I would like Cosmos to scan each of these paths and use the first value or aggregate of values as the partition key but cannot find any documentation suggesting that this is indeed the behaviour.
The MSDN documentation for this property is pretty useless and none of the associated documentation seems to answer the question. Does anyone know about or previously used multiple partition key paths in a collection?
To be clear, I'm looking for links to additional documentation about and/or direct experience of the Cosmos DB's behaviour when specifying multiple partition keys in the PartitionKey.Paths collection when creating a DocumentCollection.
This question has also been posted in the Azure Community Support forums.
Thanks, Ian
The best way to do this is to assign a generic partition key like “pk”, then assign this value based on each of your object types. You can for example, manage this during serialization by having different properties for each class to be serialized to “pk”.
The reason partition key is an array in DocumentCollection.PartitionKey is to allow us to introduce compound partition keys, where the combination of multiple properties like (“firstName”, “lastName”) form the partition key. This is a little different from what you need.
Further to the above, I ended up adding a partition key property to the document container as suggested by Aravind and then used David Fowler's excellent QueryInteceptor nuget package to apply an ExpressionVisitor which translated any equivalence expression relating to the specific document type's tenant id property into a equivalence expression on the partition key property. This ensured that queries would be performed against only the single, correct partition. Furthermore, I was able to use the ExpressionVisitor as a safety feature in that it is able to enforce that all queries provide a filter on tenant id (as, obviously, tenants should never be able to see each others documents) and if none has been specified then no records are returned (an invalid equivalence expression is added to the partition key property).
This has been tested and seems to be working well.

Cosmos DB: How to reference a document in a separate collection using DocumentDB API

I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}

DocumentDB - Assign collections to different regions

This doesn't seem possible via the Azure Portal, but perhaps I overlooked something...
DocumentDB supports only one "write region", but 0..N read regions (i.e. I assume this means 1 primary and N replicas are possible in DB terms). But this seems to be applied to the WHOLE Database. I wonder if it's possible to specify that I want some Collections to have different primary locations (i.e. each collection would have a different write region)?
If this was possible, I could use DocDB's application-level partitioning to direct my reads and writes to the appropriate Collection. The partitioning scheme I would use would be location-aware (e.g. an obvious scheme would involve a "/region" attribute on the document).
The current version of DocumentDB only allows choosing write region at an account level.
Having said that, instead of creating multiple collection in a single account, it is possible to achieve this scenario today using multiple accounts each hosting one collection with desired write region configuration.
There are additional patterns you can employ here to achieve multi-region writes for a single logical collection without requiring any location lookup.
No, you cannot set a write region for an individual collection. The account has one as you said. That one region allows write and read, the others are read-only replicas. You can find more in this documentation article.

Resources