CosmosDB: Enforce unique constraints - azure

Is it possible to enforce unique constraints on Azure CosmosDB's graph model? If I'm registering new users and need to ensure only unique email addresses/usernames/etc. are used, how can this be accomplished there?

Depending on your partitioning strategy you can use these values to enforce uniqueness either by using them in the partition key directly or by using them as an id inside of a known partition for your users.

We recently launched unique key support for Cosmos DB. This should work seamlessly for graph collections as well.
https://learn.microsoft.com/en-us/azure/cosmos-db/unique-keys
Just, create a graph with desired unique key path. After that adding a vertex with the same unique key compared to an existing vertex, should fail.

Related

In cosmosdb, should I reference other documents using id, resource id, or self link?

I'm working on designing my CosmosDB collections and deciding what I will and won't nest in a single document, etc. There's no way around it, though - there will be scenarios where I need to reference documents from one collection within another.
I see that in CosmosDB there are several ways to identify a document - id, resource id and self link. It looks like id is enforced to be unique and can either be set by server or to whatever you want it to be. Next, it looks like resource id is always auto generated by the server and is guaranteed to be unique as well. Last, it looks like self link is built up using the id of the database, collection and document, meaning it'll also be unique. I see three different unique keys, all having their own uses and semantics.
Which one should I use internally when referencing other documents?
What about referencing documents in different collections - would resource id or self link be more "universal identifier" than just id?
DO use natural key for id values, if possible.
DO use id for cross-document references.
DO use names for collection/database references.
DO NOT use _rid or _selflink when you need a reliable long-term reference.
Why not use _rid/selflink?
_rid - system-assigned identity in Comsos DB inner storage. It value is stable as long as document does not move in storage but it will change whenever document is recreated in storage.
_selflink - system-assigned identity similar to _rid, but in addition to _rid it includes similar resource sub-keys for the Cosmos DB database and collection the document is in. So it is a reference to the document from the account level.
First, most likely _rid/_selflink have the potential to be slightly more performant as they are closer to actual data. Though in 99% of situations it should be negligible.
On the downside, _rid/_selflink will change when you move your documents for whatever reason. E.g.,:
backup and restore
delete document and recreate this with exactly the same data
rename Cosmos DB database/collection (currently achieved by creating new and moving data)
recreate collection to get some new feature not applicable to existing collections
refactor collections structure by moving document types (for business/performance or security concerns)
Should this happen, you would be in a world of pain to discover and fix all references from within your data documents. Ouch. That's fragile and cumbersome assuming you have lots of documents and non-trivial models.
Also, if you look at Microsoft API clients (e.g., the C# client), then the comfortable path is nowadays is to work with database/collection names and ids. Don't fight it. You'll just make your code uglier and you own life harder than intended.
Using them for temporary ad-hoc identities is ok though.
Why id?
id is user-assigned key to a document with uniqueness guarantee within a partition.
It is optimized for retrieval in API and perf-wise = faster to develop, better performance.
It can be set to a natural key - human-readable and business-wise meaningful without loading the referenced document. = fewer lookups, less confusion, fewer RU/s.
It is part of the user data and will never change when you move your documents around = predictable behavior, fewer bad surprises during disaster recovery.
The only caveat is that, as always with user-given identities, you have to plan a bit to be sure the identity range really is unique enough for your needs. Your app can always set stricter uniqueness properties (though they would not be enforced by Cosmos DB) or if you need ultimate uniqueness, then use Guids.
What about containers?
Same arguments apply to containers/databases.
The id is only unique within the document partition. You could have as many documents with the same id as long as they have a different partition key values.
The _rid is indeed unique and it's the best form of identification for a document. You can achieve the same by using the id and also providing the partition key value if your collection is partitioned.
There are two different types of reading a document directly without querying for it.
Using its self link which looks like this dbs/db_resourceid/colls/coll_resourceid/documents/doc_resourceid and uses the _rid values
Using its alternative link which looks like this dbs/db_id/colls/coll_id/documents/doc_id which uses the id
The safest form of document identification you can use is the one that uses the _rids.
In both of your questions, you should go with the self link.

Define Graph Schema in AWS Neptune to prevent data duplication

When using TinkerPop/JanusGraph I am able to define, VertexLabels and Property Keys which I can than use to create composite indexes. I read somewhere on the Neptune documentation that indexes are not necessary (or supported).
My question is then how do I prevent duplication when loading data into the database? The only examples I found on the AWS documentation involves loading data where an Unique ID is already provided for each record, which for me seems like I would need to first extract data from a RDBMS in order to have all the IDs and their relationships before I can load it.
Am I understanding this correctly, if not how could I solve this?
Yes your understanding is correct. Uniqueness constraint for vertices & edges applies on their ~id property i.e. IDs are unique.
There are two ways to insert data into Neptune. You can either use the loader interface(recommended) or insert via Gremlin.
Case#1: Insert via bulk loader (recommended)
Inserting via loader only supports CSV format for now and as you observed, it does necessarily require user defined IDs for Vertices and Edges.
Case#2: Insert via Gremlin
For insertion via Gremlin providing IDs is optional. If you do not provide an ID, then Neptune will automatically assign a unique ID to the vertex or the edge.
e.g. g.addV() adds a vertex and assigns a unique identifier to it.
Further regarding case#2, you can add the two vertices and the relationship in the same query. This does not require knowledge of the ID auto-assigned to the vertex by the database.
g.addV().as("node1").property("name","Simba").addV().as("node2").property("name","Mufasa").addE("knows").from("node1").to("node2")
Alternatively, use a unique property identifier to query for nodes from the DB:
g.addV().property("name","Simba");
g.addV().property("name","Mufasa");
g.V().has("name","Simba").as("node1").V().has("name","Mufasa").as("node2").addE("knows").from("node1").to("node2");

Azure Cosmos db Unique Key on collection

I am trying to create an unique key for an whole collection in Cosmos DB.
So not unique per _pk.
I read this article but here it only writes about Unique key per partition: https://learn.microsoft.com/en-us/azure/cosmos-db/unique-keys.
I Googled a lot but I can't find any result about a uk on collection. Is this even possible? And if it is, is there any documentation about it?
I think the official doc about cosmos db unique key is clearly stated.
I am trying to create an unique key for an whole collection in Cosmos
DB.
Unique keys must be defined when the container is created, and the unique key is scoped to the partition key.
In the same collection there must be possible to store different
objects without an username.
Sparse unique keys are not supported. If values for some unique paths are missing, they are treated as a special null value, which takes part in the uniqueness constraint.
If you do want to make the username field unique in the whole collection across the partitions and even null value is permitted, I think you need to check the uniqueness by yourself before inserting documents into cosmos db.I suggest you using pre-triggers to do the check.
Hope it helps you.

Cosmos DB with multiple partition keys

We're looking at potentially using a single Cosmos DB collection to hold multiple document types in a multi-tenanted environment using a tenant ID as the partition key. The path to tenant id may change in each document type and I am therefore looking at various was of exposing the partition key to Cosmos DB to enable correct partitioning / querying.
I have noticed that the Paths property of DocumentCollection.PartitionKey is a collection and was therefore wondering whether it is possible to pass multiple paths during the creation of a document collection and what the behaviour of this might be. Ideally, I would like Cosmos to scan each of these paths and use the first value or aggregate of values as the partition key but cannot find any documentation suggesting that this is indeed the behaviour.
The MSDN documentation for this property is pretty useless and none of the associated documentation seems to answer the question. Does anyone know about or previously used multiple partition key paths in a collection?
To be clear, I'm looking for links to additional documentation about and/or direct experience of the Cosmos DB's behaviour when specifying multiple partition keys in the PartitionKey.Paths collection when creating a DocumentCollection.
This question has also been posted in the Azure Community Support forums.
Thanks, Ian
The best way to do this is to assign a generic partition key like “pk”, then assign this value based on each of your object types. You can for example, manage this during serialization by having different properties for each class to be serialized to “pk”.
The reason partition key is an array in DocumentCollection.PartitionKey is to allow us to introduce compound partition keys, where the combination of multiple properties like (“firstName”, “lastName”) form the partition key. This is a little different from what you need.
Further to the above, I ended up adding a partition key property to the document container as suggested by Aravind and then used David Fowler's excellent QueryInteceptor nuget package to apply an ExpressionVisitor which translated any equivalence expression relating to the specific document type's tenant id property into a equivalence expression on the partition key property. This ensured that queries would be performed against only the single, correct partition. Furthermore, I was able to use the ExpressionVisitor as a safety feature in that it is able to enforce that all queries provide a filter on tenant id (as, obviously, tenants should never be able to see each others documents) and if none has been specified then no records are returned (an invalid equivalence expression is added to the partition key property).
This has been tested and seems to be working well.

Primary key in an Azure SQL database

I'm working on a distributed system that uses CQRS and DDD principles. Based on that I decided that the primary keys of my entities should be guids, which are generated by my domain (and not by the database).
I have been reading about guids as primary keys. However, it seems that some of the best practices are not valid anymore if applied to Azure SQL database.
Sequential guids are nice if you use an on premise SQL server machine - the sequential guids that are generated will always be unique. However, on Azure, this is not the case anymore. As discussed in this thread, it's not even supported anymore; and generating them is also a bad idea as it becomes a single point of failure and it will not be guaranteed unique anymore across servers. I guess sequential guids don't make sense on Azure, so I should stick to regular guids. Is this correct?
Columns of type Guid are bad candidates for clustering. But this article states that this is not the case on Azure, and this one suggests the opposite! Which one should I believe? Should I just make my primary key a guid and leave it clustered (as it is the default for primary keys); or should I not make it clustered and choose another column for clustering?
Thanks for any insight!
the sequential guids that are generated will always be unique.
However, on Azure, this is not the case anymore.
Have a look at the bottom of this post here - http://blogs.msdn.com/b/sqlazure/archive/2010/05/05/10007304.aspx
The issue with Guid's (which rely on NEWID()) is that they will be randomly distributed which has performance issues when it comes to applying a clustered index to them.
What I'd suggest is that you use a GUID for your Primary Key. Then remove the default clustered index from that column. Apply the Clustered Index to some other field on your table (i.e. the created date) so that the records will be sequentially/contiguously indexed as they are created. And then apply a non-clustered index to your PK Guid Column.
Chances are, that will be fine from a *SELECT * FROM TABLE WHERE Id = " point of view for returning single instances.
Similarly, if you're returning lists or ranges of records for display in a list, if you specifiy the default order by CreatedDate, your clustered index will work for that
Considering the following
Sql Azure requires a clustered index to perform replication. Note, the index does not have to be unique. http://blogs.msdn.com/b/sqlazure/archive/2010/05/12/10011257.aspx
The advantage of a clustered index is that range queries on the index are performed optimally with minimum seeks.
The disadvantages of a clustered index is that, if data is added in out of sequence order, page split may occur and inserts may be relatively slower.
Referencing the above, I suggest the following
If you have a real key range you need to query upon, for example date, sequential number etc
create a (unique/non-unique) clustered index for that key.
create an additional unique index with domain generated GUIDs.
If no real key range exists, just create the clustered unique index with domain generated GUIDs. (The overheads of adding a fake unneeded clustered index would be more of a hindrance than a help.)

Resources