Cosmos DB: How to reference a document in a separate collection using DocumentDB API - azure

I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?

You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.

What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}

Related

How to structure relationships in Azure Cosmos DB?

I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.

In cosmosdb, should I reference other documents using id, resource id, or self link?

I'm working on designing my CosmosDB collections and deciding what I will and won't nest in a single document, etc. There's no way around it, though - there will be scenarios where I need to reference documents from one collection within another.
I see that in CosmosDB there are several ways to identify a document - id, resource id and self link. It looks like id is enforced to be unique and can either be set by server or to whatever you want it to be. Next, it looks like resource id is always auto generated by the server and is guaranteed to be unique as well. Last, it looks like self link is built up using the id of the database, collection and document, meaning it'll also be unique. I see three different unique keys, all having their own uses and semantics.
Which one should I use internally when referencing other documents?
What about referencing documents in different collections - would resource id or self link be more "universal identifier" than just id?
DO use natural key for id values, if possible.
DO use id for cross-document references.
DO use names for collection/database references.
DO NOT use _rid or _selflink when you need a reliable long-term reference.
Why not use _rid/selflink?
_rid - system-assigned identity in Comsos DB inner storage. It value is stable as long as document does not move in storage but it will change whenever document is recreated in storage.
_selflink - system-assigned identity similar to _rid, but in addition to _rid it includes similar resource sub-keys for the Cosmos DB database and collection the document is in. So it is a reference to the document from the account level.
First, most likely _rid/_selflink have the potential to be slightly more performant as they are closer to actual data. Though in 99% of situations it should be negligible.
On the downside, _rid/_selflink will change when you move your documents for whatever reason. E.g.,:
backup and restore
delete document and recreate this with exactly the same data
rename Cosmos DB database/collection (currently achieved by creating new and moving data)
recreate collection to get some new feature not applicable to existing collections
refactor collections structure by moving document types (for business/performance or security concerns)
Should this happen, you would be in a world of pain to discover and fix all references from within your data documents. Ouch. That's fragile and cumbersome assuming you have lots of documents and non-trivial models.
Also, if you look at Microsoft API clients (e.g., the C# client), then the comfortable path is nowadays is to work with database/collection names and ids. Don't fight it. You'll just make your code uglier and you own life harder than intended.
Using them for temporary ad-hoc identities is ok though.
Why id?
id is user-assigned key to a document with uniqueness guarantee within a partition.
It is optimized for retrieval in API and perf-wise = faster to develop, better performance.
It can be set to a natural key - human-readable and business-wise meaningful without loading the referenced document. = fewer lookups, less confusion, fewer RU/s.
It is part of the user data and will never change when you move your documents around = predictable behavior, fewer bad surprises during disaster recovery.
The only caveat is that, as always with user-given identities, you have to plan a bit to be sure the identity range really is unique enough for your needs. Your app can always set stricter uniqueness properties (though they would not be enforced by Cosmos DB) or if you need ultimate uniqueness, then use Guids.
What about containers?
Same arguments apply to containers/databases.
The id is only unique within the document partition. You could have as many documents with the same id as long as they have a different partition key values.
The _rid is indeed unique and it's the best form of identification for a document. You can achieve the same by using the id and also providing the partition key value if your collection is partitioned.
There are two different types of reading a document directly without querying for it.
Using its self link which looks like this dbs/db_resourceid/colls/coll_resourceid/documents/doc_resourceid and uses the _rid values
Using its alternative link which looks like this dbs/db_id/colls/coll_id/documents/doc_id which uses the id
The safest form of document identification you can use is the one that uses the _rids.
In both of your questions, you should go with the self link.

homogeneous vs heterogeneous in documentdb

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.
I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search.
and to:
"Combine different types of documents into a single collection and add
a field across all to separate them in searching like a type field or
something"
I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.
There might not be may fields that overlap between the objects (example: Email and firm objects)
I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?
Any good articles on DocumentDb schema design?
Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.
Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.
class InboundEmail
{
public string Key {get; set;} = "EmailsPartition";
// other properties
}
class User
{
public string Key {get; set;} = "UsersPartition";
// other properties
}
What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.
So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.
If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.
EDIT:
I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.

DocumentDB data structure misunderstanding

I'm starting a new website project and i would like to use DocumentDB as database instead of traditional RDBMS.
I will need two kind of documents to store:
User documents, they will hold all the user data.
Survey documents, that will hold all data about survays.
May i put both kind in a single collection or should i create one collection for each?
How you do this is totally up to you - it's a fairly broad question, and there are good reasons for combining, and good reasons for separating. But objectively, you'll have some specific things to consider:
Each collection has its own cost footprint (starting around $24 per collection).
Each collection has its own performance (RU capacity) and storage limit.
Documents within a collection do not have to be homogeneous - each document can have whatever properties you want. You'll likely want some type of identification property that you can query on, to differentiate document types, should you store them all in a single collection.
Transactions are collection-scoped. So, for example, if you're building server-side stored procedures and need to modify content across your User and Survey documents, you need to keep this in mind.

sub partitioning or composite partitioning document db

In one article of msdn,
https://azure.microsoft.com/en-in/documentation/articles/documentdb-partition-data/,
there is a line which specifies that "sub-partitioning" or "complex partitioning" can be done. Does this mean :
There can be sub-partitioning inside a collection?
In a single DocumentDb, there can be more than one partitioning logic? For example, I will have four collections inside a single Document Db. Can two of them can be based on hash and the other two on range?
If either of those answers is YES, then can someone provide me a link that might lead me to an example of the same?
Answers:
There is no explicit method to sub-partition data within a collection. It's common to use a field to represent the type of document or to have isTypeA: true key value pairs on each document, but that's a convention that your application adopts. However, you can create multiple databases (default limit 5 but may be extended upon request) per account and each can have their own set of collections. I'm using that two-level hierarchy in (temporalize-api). TenantID determines my top-level partitioning (database) using a lookup table plus defaults. This allows me to pull critical or high value tenants into a less loaded database and leave everyone else in the default. I use a consistent hash on the EntityID for second-level partitioning (collection).
Sure, there is nothing preventing you from doing that. Pay particular attention to the excellent discussion in the last section (Developing a partitioned application) in the Aravind article you linked to. It includes a checklist of things you'll need to decide upon and implement. The partition resolvers provided for the .NET SDK do not take care of these issues for you.
I haven't yet seen open source examples of what I would consider a complete system including balancing when capacity is added, where to store the partition maps/meta-data, and query fan-out/aggregate optimization. I have a node.js one under way (temporalize-api) and actually in production. I've made decisions about how I'm going to do balancing and query fan-out and those are documented in the comments in that linked file, but I have not implemented all of them. I store the partition meta-data in the "first" collection of the "first" database.

Resources