DocumentDB - Assign collections to different regions - azure

This doesn't seem possible via the Azure Portal, but perhaps I overlooked something...
DocumentDB supports only one "write region", but 0..N read regions (i.e. I assume this means 1 primary and N replicas are possible in DB terms). But this seems to be applied to the WHOLE Database. I wonder if it's possible to specify that I want some Collections to have different primary locations (i.e. each collection would have a different write region)?
If this was possible, I could use DocDB's application-level partitioning to direct my reads and writes to the appropriate Collection. The partitioning scheme I would use would be location-aware (e.g. an obvious scheme would involve a "/region" attribute on the document).

The current version of DocumentDB only allows choosing write region at an account level.
Having said that, instead of creating multiple collection in a single account, it is possible to achieve this scenario today using multiple accounts each hosting one collection with desired write region configuration.
There are additional patterns you can employ here to achieve multi-region writes for a single logical collection without requiring any location lookup.

No, you cannot set a write region for an individual collection. The account has one as you said. That one region allows write and read, the others are read-only replicas. You can find more in this documentation article.

Related

How to use Azure Search Service with heterogenous data sources

I have worked on Azure Search service previously where I created an indexer directly on a SQL DB in the Azure Portal.
Now I have a use-case where I would want to ingest from multiple data sources each having different data schema. Assume these data sources to be 3 search APIs of X,Y,Z teams. All of them take search term and gives back results in their own schema. I want my Azure Search Service to be proxy for these so that I have one search API that a user can use to get results from multiple sources, ordered correctly.
How should I go about doing it? I assume that I might have to create a common schema and whenever user searches something, I would call these 3 APIs and get results, map them to a common schema and then index this data in common schema into Azure Search index. Finally, call this Azure Search API to give back the results to the caller.
I would appreciate any help! If I can get hold of a better documentation for doing this work, that will be great as well.
Your assumption is correct. You can work with 3 different indexes and fire queries against them, or you can try to combine all of them in the same index. The benefit of the second approach is a better way to implement ordering / paging as all the information will be stored in the same index.
It really depends on what you mean by ordered correctly. Should team X be able to see results from teams Y and Z? The only way you can get ranked results like this is to maintain a single index with a common schema containing data from all teams.
One potential pitfall with this approach is conflicts in the schema. For example if one team requires a field to be of a specific datatype or use a specific analyzer, while another team has different requirements. We do this in our indexes, but with some carefully selected common fields and then dedicated fields prefixed according to our own naming convention to avoid conflicts.
One thing to consider is the need to reset the index. If you need to add, change or remove fields you will have to delete the index and create it again with a new schema. If you have a common index and team X needs to add a new property, you would need to reset (delete and create) the common index which affects all teams.
So, creating separate indexes per team has its benefits. Each team can have their own schema without risk of conflicts and they can reset their index without affecting the other teams.

In cosmosdb, should I reference other documents using id, resource id, or self link?

I'm working on designing my CosmosDB collections and deciding what I will and won't nest in a single document, etc. There's no way around it, though - there will be scenarios where I need to reference documents from one collection within another.
I see that in CosmosDB there are several ways to identify a document - id, resource id and self link. It looks like id is enforced to be unique and can either be set by server or to whatever you want it to be. Next, it looks like resource id is always auto generated by the server and is guaranteed to be unique as well. Last, it looks like self link is built up using the id of the database, collection and document, meaning it'll also be unique. I see three different unique keys, all having their own uses and semantics.
Which one should I use internally when referencing other documents?
What about referencing documents in different collections - would resource id or self link be more "universal identifier" than just id?
DO use natural key for id values, if possible.
DO use id for cross-document references.
DO use names for collection/database references.
DO NOT use _rid or _selflink when you need a reliable long-term reference.
Why not use _rid/selflink?
_rid - system-assigned identity in Comsos DB inner storage. It value is stable as long as document does not move in storage but it will change whenever document is recreated in storage.
_selflink - system-assigned identity similar to _rid, but in addition to _rid it includes similar resource sub-keys for the Cosmos DB database and collection the document is in. So it is a reference to the document from the account level.
First, most likely _rid/_selflink have the potential to be slightly more performant as they are closer to actual data. Though in 99% of situations it should be negligible.
On the downside, _rid/_selflink will change when you move your documents for whatever reason. E.g.,:
backup and restore
delete document and recreate this with exactly the same data
rename Cosmos DB database/collection (currently achieved by creating new and moving data)
recreate collection to get some new feature not applicable to existing collections
refactor collections structure by moving document types (for business/performance or security concerns)
Should this happen, you would be in a world of pain to discover and fix all references from within your data documents. Ouch. That's fragile and cumbersome assuming you have lots of documents and non-trivial models.
Also, if you look at Microsoft API clients (e.g., the C# client), then the comfortable path is nowadays is to work with database/collection names and ids. Don't fight it. You'll just make your code uglier and you own life harder than intended.
Using them for temporary ad-hoc identities is ok though.
Why id?
id is user-assigned key to a document with uniqueness guarantee within a partition.
It is optimized for retrieval in API and perf-wise = faster to develop, better performance.
It can be set to a natural key - human-readable and business-wise meaningful without loading the referenced document. = fewer lookups, less confusion, fewer RU/s.
It is part of the user data and will never change when you move your documents around = predictable behavior, fewer bad surprises during disaster recovery.
The only caveat is that, as always with user-given identities, you have to plan a bit to be sure the identity range really is unique enough for your needs. Your app can always set stricter uniqueness properties (though they would not be enforced by Cosmos DB) or if you need ultimate uniqueness, then use Guids.
What about containers?
Same arguments apply to containers/databases.
The id is only unique within the document partition. You could have as many documents with the same id as long as they have a different partition key values.
The _rid is indeed unique and it's the best form of identification for a document. You can achieve the same by using the id and also providing the partition key value if your collection is partitioned.
There are two different types of reading a document directly without querying for it.
Using its self link which looks like this dbs/db_resourceid/colls/coll_resourceid/documents/doc_resourceid and uses the _rid values
Using its alternative link which looks like this dbs/db_id/colls/coll_id/documents/doc_id which uses the id
The safest form of document identification you can use is the one that uses the _rids.
In both of your questions, you should go with the self link.

Cosmos DB with multiple partition keys

We're looking at potentially using a single Cosmos DB collection to hold multiple document types in a multi-tenanted environment using a tenant ID as the partition key. The path to tenant id may change in each document type and I am therefore looking at various was of exposing the partition key to Cosmos DB to enable correct partitioning / querying.
I have noticed that the Paths property of DocumentCollection.PartitionKey is a collection and was therefore wondering whether it is possible to pass multiple paths during the creation of a document collection and what the behaviour of this might be. Ideally, I would like Cosmos to scan each of these paths and use the first value or aggregate of values as the partition key but cannot find any documentation suggesting that this is indeed the behaviour.
The MSDN documentation for this property is pretty useless and none of the associated documentation seems to answer the question. Does anyone know about or previously used multiple partition key paths in a collection?
To be clear, I'm looking for links to additional documentation about and/or direct experience of the Cosmos DB's behaviour when specifying multiple partition keys in the PartitionKey.Paths collection when creating a DocumentCollection.
This question has also been posted in the Azure Community Support forums.
Thanks, Ian
The best way to do this is to assign a generic partition key like “pk”, then assign this value based on each of your object types. You can for example, manage this during serialization by having different properties for each class to be serialized to “pk”.
The reason partition key is an array in DocumentCollection.PartitionKey is to allow us to introduce compound partition keys, where the combination of multiple properties like (“firstName”, “lastName”) form the partition key. This is a little different from what you need.
Further to the above, I ended up adding a partition key property to the document container as suggested by Aravind and then used David Fowler's excellent QueryInteceptor nuget package to apply an ExpressionVisitor which translated any equivalence expression relating to the specific document type's tenant id property into a equivalence expression on the partition key property. This ensured that queries would be performed against only the single, correct partition. Furthermore, I was able to use the ExpressionVisitor as a safety feature in that it is able to enforce that all queries provide a filter on tenant id (as, obviously, tenants should never be able to see each others documents) and if none has been specified then no records are returned (an invalid equivalence expression is added to the partition key property).
This has been tested and seems to be working well.

sub partitioning or composite partitioning document db

In one article of msdn,
https://azure.microsoft.com/en-in/documentation/articles/documentdb-partition-data/,
there is a line which specifies that "sub-partitioning" or "complex partitioning" can be done. Does this mean :
There can be sub-partitioning inside a collection?
In a single DocumentDb, there can be more than one partitioning logic? For example, I will have four collections inside a single Document Db. Can two of them can be based on hash and the other two on range?
If either of those answers is YES, then can someone provide me a link that might lead me to an example of the same?
Answers:
There is no explicit method to sub-partition data within a collection. It's common to use a field to represent the type of document or to have isTypeA: true key value pairs on each document, but that's a convention that your application adopts. However, you can create multiple databases (default limit 5 but may be extended upon request) per account and each can have their own set of collections. I'm using that two-level hierarchy in (temporalize-api). TenantID determines my top-level partitioning (database) using a lookup table plus defaults. This allows me to pull critical or high value tenants into a less loaded database and leave everyone else in the default. I use a consistent hash on the EntityID for second-level partitioning (collection).
Sure, there is nothing preventing you from doing that. Pay particular attention to the excellent discussion in the last section (Developing a partitioned application) in the Aravind article you linked to. It includes a checklist of things you'll need to decide upon and implement. The partition resolvers provided for the .NET SDK do not take care of these issues for you.
I haven't yet seen open source examples of what I would consider a complete system including balancing when capacity is added, where to store the partition maps/meta-data, and query fan-out/aggregate optimization. I have a node.js one under way (temporalize-api) and actually in production. I've made decisions about how I'm going to do balancing and query fan-out and those are documented in the comments in that linked file, but I have not implemented all of them. I store the partition meta-data in the "first" collection of the "first" database.

How to selectively replicate private and shared portions of a CouchDB database?

We're looking into using CouchDB/CouchCocoa to replicate data to our mobile app.
Our system has a large number of users. Part of the database is private to each user -- for example their tasks. These I've been able to replicate without problem using filtered replication.
Here's the catch... The database also includes shared information only some of which pertains to a given user. How do I selectively replicate that shared information? For example a user's task might reference specific shared documents. Is there a way to make sure those documents are included in the replication without including all the shared documents?
From the documentation it seems that adding doc_ids to the replication (or adding another replication with those doc ids) might be one solution. Has any one tried this? Are there other solutions?
EDIT: Given the number of users it seems impractical to tag each shared document with all the users sharing it but perhaps that's the only way to do this?
Final solution mostly depends on your documents structure, but currently I see two use-cases:
As you keep everything within single database, probably you have some fields set to recognize, that document is shared or document is private, right? Example:
owner: "Mike"
participants: [] // if there is nobody mentioned, document looks like as private(?)
So you just need some filter that would handle only private documents and only shared ones: by tags, number of participants, references or somehow.
Also, if you need to replicate some documents only for specific user (e.g. only for Mike), than you need special view to handle all these documents and, yes, use replication by document ids, but this wouldn't be an atomic request: you need some service script to handle these steps. If shared documents are defined by references to them, than the only solution is the same: some service script, view that generated document reference tree and replication by doc._id's.
Review your architecture. Having per user database is normal use-case for CouchDB and follows way of data partitioning and isolation. So you may create per user database that would be private only for that user. For shared documents you may create additional databases playing with database members of security options. Each "shared" database will handle only certain number of participants by names or by groups, so there couldn't be any data leaks unless that was not a CouchDB bug(:
This approach looks too weird from first sight, but everything you've needed there is to create some management script that would handle database creation and publication, replications would be easy as possible and users data is in safe.
P.S. I've supposed that "sharing" operation makes document visible not for every one, but for some set of users. If I was wrong and "shared" state means "public" state than p2. will be more simpler: N users databases + 1 public one.

Resources