This is how my document looks, I wanted to choose the partition key as /department/city; which is two different attribute, one form employee and another from address(which is an embedded object to employee). I tried give /department/address as partition key but it's not listing the partition key in data-explorer, I'm assuming it considering city is a attribute in department.
{
"eid": "",
"entryType": "",
"address":
{
"PIN": "",
"city": "",
"street": ""
},
"name": "",
"id": "",
"department": "",
"age":
}
Can you please help me understand what I'm doing wrong and how to design a composite partition key and distribute/store/arrange employees data based on their department and city.
Can you please help me understand what I'm doing wrong and how to
design a composite partition key and distribute/store/arrange
employees data based on their department and city.
If I am not mistaken, currently composite partition keys are not supported. In a collection you must define a partition key using just one attribute.
However if you look at the REST API, the partition key is defined as an array (albeit that array only contains 1 element). This tells me that Azure may support composite partition keys in future.
So, for now you pick one attribute (either department or city) to partition the data and define an index on the other attribute for faster searching.
In my CosmosDb multi-partitioned collections I generally specify that the partitionKey should be generic and just use a property that is literally called "partitionKey". The advantage of this is that you have much more fine grained control over how your data is stored by simply specifying a value that makes sense for the particular POCO you are inserting.
There's a high likelihood that somewhere down the line you're going to want to insert a document into this collection that doesn't conform to the structure of the document you showed us here. At this point you'll either have to split your docs across multiple collections or redesign your whole partition key strategy.
The trade off here is that during querying you have to have some knowledge of how these values are applied but this can easily be enforced in code through a lightweight ORM layer.
Related
In Azure Cosmos DB, there is support for unique keys. Each unique key is defined as a set of paths representing values in the stored documents. An example of such a path would be /contact/firstName. It's not clear from the official docs (in fact it's not mentioned at all) how those paths apply down through embedded arrays within the document, or how unique key semantics apply when paths navigate into nested documents with a cardinality of more than one.
For example, let's say I have a document like this to store a user group and a set of member users:
{
"id": "ABCD1234",
"name": "Administrators",
"members":
[
{
"userId": 1,
"enabled": true
},
{
"userId": 2,
"enabled": true
}
]
}
I want the group name to be unique across the logical partition, so I add a unique key with path /groupName.
I also want to ensure that the members are unique, i.e. that the same userId value does not occur more than once within a given group. Naïvely, I might try creating a unique key with the paths /name and /members/userId. But that doesn't work, the unique key seems to have no effect.
I've tried a few different variations of this, but none of them had the effect I was expecting.
So my questions:
Is it possible to create unique keys that "traverse" into arrays of embedded objects?
If so, what is the correct path syntax for that?
Given that unique keys mean "unique across the whole logical partition" as opposed to "unique across the document", what would happen if I actually did manage to define a unique key involving the properties on the embedded members objects, and tried to save two different groups that both have zero members? Would those keys not then logically evaluate as null or undefined for both group, thereby preventing me from saving one of them?
Thankful for any insights to help clear this up!
Unique keys do not traverse into arrays within documents which is why they are not documented as such.
For details on what a logical partition is please see our docs on partitions
If you want uniqueness like what you are describing then create as different documents within a logical partition.
I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}
I am very new to cosmosdb(documentdb), while going through the documentation I keep on reading one thing repeatedly that documentdb is schema free but I feel like collection in analogous to schema and both are logical view.
Wikipedia defined schema as 'The term "schema" refers to the organization of data as a blueprint of how the database is constructed'. I believe collection is also same it's the organization of document, stored prcedure, triggers and UDF.
So my question is, how schema is different from collection?
Collections really have nothing to do with schema. They are just an organizational construct for documents. With Cosmos DB, they serve as:
a transaction boundary. Within a collection, you can perform multiple queries / updates within a transaction, utilizing stored procedures. These updates are constrained to a single collection (more specifically, to a single partition within a collection).
a billing/performance boundary. Cosmos DB lets you specify the number of Request Units (RU) / second to allocate to a collection. Every collection can have a different RU setting. Every collection has a minimum cost (due to minimum amount of RU that must be allocated), regardless of how much storage you consume.
a server-side code boundary. Stored procedures, triggers, etc. are uploaded to a specific collection.
Whether you choose to create a single collection per object type, or store multiple object types within a single collection, is entirely up to you. And unrelated to the shape of your data.
The schema of relational databases is slightly different from the schema of document databases. In simple terms, a relational database is stricter than that of a document schema. In other words, records in an RDBMS table must strictly adhere to the schema, where as we have some amount of flexibility while storing a document into a Document collection.
Conventionally a collection is a set of documents which follows the same schema. But document DBs don't stop one from storing documents with different schema in a single collection. It is the flexibility it gives to the users.
Let us take an example. Let us assume we are storing some customer information.
In relational DB, we might have some structure like
Customer ID INT
Name VARCHAR(50)
Phone VARCHAR(15)
Email VARCHAR(255)
Depending on customer having an email or phone number, they will be recorded as proper values or null values.
ID, Name, Phone, Email
1, John, 83453452, -
2, Victor, -, -
3, Smith, 34535345, smith#jjjj
However in document databases, some columns need to appear in the collection, if they don't have any values.
[
{
id: "123",
name: "John",
phone:"2572525",
},
{
id: "456",
name: "Stephen",
},
{
id: "789",
name: "King",
phone:"2572525",
email:"king#asfaf"
}
]
However it is always advisable to stick to a schema in document db's even if they provide flexibility to store schema-less documents to a collection for maintainability purposes.
We are planning to move from MySql to Cloudant NoSql. I want to understand what would be the best approach to do that.
We have 5 different tables--Product (ProductId Primary key), Issues (IssueId primary key, ProductId Foreign key) and Tags (Tag id Primary key, ProductId Foreign key) and Location (LocationId primary key location as foreign key with location in product table) and Policy (policyId primary key, IssueId as primary key).
Now we are thought of two approaches for maintaining documents in Cloudant.
Keep different documents for each row with unique document type per table (for each table one document type ex document types as "product","issues,"tag","location","policy" ).
Keep different document for each row with all relation defined in one document (all documents with type "product" only where maintaining all tags,issues[policies],location per product).
Which approach is better?
The answer really depends on the size and rate of growth of your data. In previous SQL->NoSQL migrations, I've used your second approach (I don't know your exact schema, so I'll guess):
{
_id: "prod1",
name: "My product",
tags: [
"red", "sport", "new"
],
locations: [
{
location_id: "55",
name: "London",
latitude: 51.3,
longitude: 0.1
}
],
issues: [
{
issue_id: "466",
policy_id: "88",
name: "issue name"
}
]
}
This approach allows you to get almost everything about a product in a single Cloudant API call (GET /products/prod1). Such a call will give you all of your primary product data and what would have been joins in a SQL world - in this case arrays of things or arrays of objects.
You may still want another database of locations or policies because you may want to store extra information about those objects in separate collections, but you can store a sub-set of that data (e.g. the location's name and geo position) in the product document. This does mean duplicating some data from your reference "locations" collection inside each product, but leads to greater efficiency at query time (at the expense of making updates to the data more complicated).
It all depends about how you access the data. For speed and efficiency you want to be able retrieve the data you need to render a page in as few API calls as possible. If you keep everything in its own database, then you need to do the joins yourself, because Cloudant doesn't have joins. This would be inefficient because you would need an extra API call for each "join".
There is another way to managed "joins" in Cloudant and this may be suitable if your secondary collections are large e.g. if the number of locations/tags/issues would make the product document size too large.
We have a simple case where we want to take JSON documents directly from an API provider (Github) and store them in a DocumentDB Collection. Unfortunately, the documents happen to have an "id" field which is numeric, and thus causes an error when trying to create the document.
This must be a common scenario, and I found a post which seems to indicate "the worst". However, I'm looking for confirmation. I'm holding out a little hope that I don't have to write custom handling for the ID field, which modifies all documents upon every storage and retrieval operation just to make them compatible with DocumentDB.
https://social.msdn.microsoft.com/Forums/vstudio/en-US/26386227-4aa2-48d5-9cc4-547caef18fb5/id-field-work-around-help-needed?forum=AzureDocumentDB
Like I mentioned in comments, I'm not exactly sure what the primary issue is, but DocumentDB's id property is a string. You'd need to convert your GitHub content's numeric id property to a string before saving in DocumentDB. Alternatively, you may create your own numeric property (other than id) to maintain the numeric data type, for future querying.
You cannot change the data type of id from string to numeric, within the collection itself.
Thanks for the feedback. I think the suggestion in the article I provided is much more attractive than implementing the logic to mutating the type of any field which happens to be named "id" and happens to not be a string just for working with DocumentDB. I'm new to NoSQL in general, but I can see some other benefits for encapsulating documents on DocumentDB into outer objects based on their type name as well. So, if I'm working with a repository, the JSON would be shaped like so:
{
"repository": {
"id": "Id from github",
"foo": "bar"
},
"id" : "document-db-id"
}
Using this wrapping approach as a convention throughout my application makes sense to me. I simply need to add a "type" parameter throughout my application for DocumentDB operations.
I can see some downsides as well. For example, it might require a layer between any Entity Frameworks for DocumentDB and SDKs which provide a data model for an API.
If anyone has any experience or advice about wrapping documents in "type" containers for NOSQL of any kind, it would be welcome.