I'm wondering the best way to design tables in QLDB and whether it's best to perform joins or perhaps have nested documents.
For example, if I have the tables transaction and payment where a payment must be associated to a transaction. Which of the following options are best;
Nested Document Option (One table)
{
'payment_reference': 'abc123',
'transaction': {
'id': 123,
'name': 'John Doe',
'amount': '$10'
},
'fees': '$2',
'amount_paid': '$12'
}
Two Table Option
Payment Document
{
'payment_reference': 'abc123',
'transaction_id': 12,
'fees': '$2',
'amount_paid': '$12'
}
Transaction Document
{
'id': 123,
'amount': '$10',
'name': 'John Doe',
}
I think #Aurgho has answered your question. But I am going to put my general thoughts based on what Aurgho said, which might help others coming to this post with similar question.
There are multiple factors that can influence your design decision, along with the quotas and limits QLDB imposes. Here are few pointers that might help you think forward:
Query Pattern: At this point, Amazon QLDB allows creation of indexes only on the top level fields. In the nested document design(Option #1), if your queries are going to be on any of the fields of the nested document, then those queries won't use index and will perform scans. This can impact your performance. With Option #2, you can have indexes on both the tables and use those indexed fields in your join criteria.
Access pattern: Are you going to have significantly more writes than reads? If your reads are sparse and not extremely sensitive to a little elevated latency, Option #1 might be better from data modeling perspective, where are all the payment related information is captured in a single document. On the other hand, if you have a lot more reads and the reads are latency sensitive, you should evaluate your options from the previous point's perspective.
Quotas and Limits: Amazon QLDB has quotas on the document size (which is currently at 128 KB) https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.fixed. If your plan to add more fields as you go, the per document size can keep increasing with the nested fields and you might eventually run into the document size limit. There are other quotas too which can impact your decision based on your use case.
Generally speaking, if you are not going to query on a field in the nested document and/or your writes >>> reads and/or your reads are not super sensitive to latency and/or your document size will stay within the currently imposed limits, you could do with Option #1. Having all your data in one document can ease you at the application layer when you are pushing the data into QLDB(just one insert) and when you have to process the documents in your code, but you will have to choose your trade-offs correctly.
These are just general pointers to help you think forward. You could have other use cases where either of the design options becomes more convincing than the other and you can trade-off certain advantages/disadvantages between the two.
Also, QLDB has some recommendations to optimize your query performance, which can further help you with your decision https://docs.aws.amazon.com/qldb/latest/developerguide/working.optimize.html
If, as in the nested document option, transaction documents are chosen to be nested inside payment documents, please keep in mind that the document size limit is 128KB as mentioned in the QLDB limits documentation . If the payment document can be foreseen to be large enough to hit this limit after nesting, this option could be risky.
If you foresee having to index on some of the fields in the transaction documents, you can create two separate tables and perform a join instead. (As noted in the create index reference, QLDB does not allow indexing on nested values of document and as mentioned in our limits documentation, AWS QLDB allows a maximum of 5 indexes per table)
The above recommendations are only based on the information provided in the post and we are unaware of the current access patterns in this use-case and will require further understanding to be able to answer better.
You can reach out to the team at qldb-outbound AT amazon.com for further consultation regarding your use-case.
Thanks
Related
I have two sets of data in the same collection in cosmos, one are 'posts' and the other are 'users', they are linked by the posts users create.
Currently my structure is as follows;
// user document
{
id: 123,
postIds: ['id1','id2']
}
// post document
{
id: 'id1',
ownerId: 123
}
{
id: 'id2',
ownerId: 123
}
My main issue with this setup is the fungible nature of it, code has to enforce the link and if there's a bug data will very easily be lost with no clear way to recover it.
I'm also concerned about performance, if a user has 10,000 posts that's 10,000 lookups I'll have to do to resolve all the posts..
Is this the correct method for modelling entity relationships?
As said by David, it's a long discussion but it is a very common one so, since I have on hour or so of "free" time, I'm more than glad to try to answer it, once for all, hopefully.
WHY NORMALIZE?
First thing I notice in your post: you are looking for some level of referential integrity (https://en.wikipedia.org/wiki/Referential_integrity) which is something that is needed when you decompose a bigger object into its constituent pieces. Also called normalization.
While this is normally done in a relational database, it is now also becoming popular in non-relational database since it helps a lot to avoid data duplication which usually creates more problem than what it solves.
https://docs.mongodb.com/manual/core/data-model-design/#normalized-data-models
But do you really need it? Since you have chosen to use JSON document database, you should leverage the fact that it's able to store the entire document and then just store the document ALONG WITH all the owner data: name, surname, or all the other data you have about the user who created the document. Yes, I’m saying that you may want to evaluate not to have post and user, but just posts, with user info inside it.This may be actually very correct, as you will be sure to get the EXACT data for the user existing at the moment of post creation. Say for example I create a post and I have biography "X". I then update my biography to "Y" and create a new post. The two post will have different author biographies and this is just right, as they have exactly captured reality.
Of course you may want to also display a biography in an author page. In this case you'll have a problem. Which one you'll use? Probably the last one.
If all authors, in order to exist in your system, MUST have blog post published, that may well be enough. But maybe you want to have an author write its biography and being listed in your system, even before he writes a blog post.
In such case you need to NORMALIZE the model and create a new document type, just for authors. If this is your case, then, you also need to figure out how to handler the situation described before. When the author will update its own biography, will you just update the author document, or create a new one? If you create a new one, so that you can keep track of all changes, will you also update all the previous post so that they will reference the new document, or not?
As you can see the answer is complex, and REALLY depends on what kind of information you want to capture from the real world.
So, first of all, figure out if you really need to keep posts and users separated.
CONSISTENCY
Let’s assume that you really want to have posts and users kept in separate documents, and thus you normalize your model. In this case, keep in mind that Cosmos DB (but NoSQL in general) databases DO NOT OFFER any kind of native support to enforce referential integrity, so you are pretty much on your own. Indexes can help, of course, so you may want to index the ownerId property, so that before deleting an author, for example, you can efficiently check if there are any blog post done by him/her that will remain orphans otherwise.
Another option is to manually create and keep updated ANOTHER document that, for each author, keeps track of the blog posts he/she has written. With this approach you can just look at this document to understand which blog posts belong to an author. You can try to keep this document automatically updated using triggers, or do it in your application. Just keep in mind, that when you normalize, in a NoSQL database, keep data consistent is YOUR responsibility. This is exactly the opposite of a relational database, where your responsibility is to keep data consistent when you de-normalize it.
PERFORMANCES
Performance COULD be an issue, but you don't usually model in order to support performances in first place. You model in order to make sure your model can represent and store the information you need from the real world and then you optimize it in order to have decent performance with the database you have chose to use. As different database will have different constraints, the model will then be adapted to deal with that constraints. This is nothing more and nothing less that the good old “logical” vs “physical” modeling discussion.
In Cosmos DB case, you should not have queries that go cross-partition as they are more expensive.
Unfortunately partitioning is something you chose once and for all, so you really need to have clear in your mind what are the most common use case you want to support at best. If the majority of your queries are done on per author basis, I would partition per author.
Now, while this may seems a clever choice, it will be only if you have A LOT of authors. If you have only one, for example, all data and queries will go into just one partition, limiting A LOT your performance. Remember, in fact, that Cosmos DB RU are split among all the available partitions: with 10.000 RU, for example, you usually get 5 partitions, which means that all your values will be spread across 5 partitions. Each partition will have a top limit of 2000 RU. If all your queries use just one partition, your real maximum performance is that 2000 and not 10000 RUs.
I really hope this help you to start to figure out the answer. And I really hope this help to foster and grow a discussion (how to model for a document database) that I think it is really due and mature now.
I have to design a schema in such a way that I can store user id and their order which can be multiple products like bread, butter plus in addition to that I want to store the quantity of product ordered, please guide.
It is difficult to provide you with a real solution to your problem as designing a NoSQL DB structure depends on how you want to access your data. You can keep orders as nested/embedded documents in the User model or store them in a separate collection. In the first case, you will have all the data in one requests, but you will not be able to query and receive orders, that match certain criteria as you will get all orders including those that match. And then you would need to filter them out. Or you could use aggregation to get exactly what you need.
However, there is a limitation to keep in mind. MongoDB document has a size limitation - 16 megabytes. Since users may have very many orders, you can reach the document size limit for some users for sure. Aggregation also has a limitation - Pipeline stages have a limit of 100 megabytes of RAMe but you can override it.
Having orders in a separate collection would require you to separately load them for users. While it is one more request, it will give you more flexibility in terms of how you query them.
Then, of course, create/update operations are also done differently for both cases.
My advice would be that you carefully design your application first - what data you need and where you will show it, how you create/update it. It will give you a better idea and chances are that relational DB will be a better choice for what you need (though absolutely not necessary).
I am new to Azure Cosmos DB using the DocumentDB API. I plan to model my data so that one document references another document. This is pretty straight forward, as described in Modeling document data. However, I also would like to separate the related documents into different collections (this decision is related to how the data are partitioned).
Edit 7/24/2017: In response to a comment wondering why I chose to use separate collections: The reasoning for a separate collections mainly comes down to partition keys and read/write priorities. Since a certain partition key is required to be present in ALL documents in the collection, it then makes sense to separate documents that the chosen partition key doesn't belong. After much weighing of options, the partition key that I settled on was one that would optimize write speeds and evenly distribute my data across shards - but unfortunately it did not logically belong in my "Metadata" documents. Since there is a one to gazillion relationship between metadata and measurements, I chose to use a reference to the metadata in the measurements instead of embedding. And because metadata would rarely (or never) be appended to each measurement, I considered the expense of an additional round-trip to the DB a very low concern.
Since the reference is a "weak link" that is not verified by the database, is it possible and wise to store additional information, such as the collection name? That is, instead of having just a string id, we may use a kind of path?
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata-id" : "../Metadata/metadata1",
...
}
Then, when I parse the data in my application/script I know what collection and document to query.
Finally, I assume there are other/better ways to go about this and I welcome your suggests (e.g. underscores, not slashes; use a symbol to represent a collection, like $Metadata; etc). Or, is my use of relations spanning collections a code smell?
Thank you!
Edit: To the downvoter, can you please explain your reasoning? Is my question uninformed, unclear, or not useful? Why?
You're thinking about this the wrong way and incurring significantly more cost for an "optimization" that isn't necessary since you're billed at the per collection level. What you should be doing is picking a more generic partition key. Something like key or partitionKey. The tradeoff here is that you'll need to ensure in your client application that you populate this property on all of your documents (it may lead to a duplicated value, but ultimately that's okay). They you can continue to use the value of whatever you chose originally for your Measurements document and set something different for your Metadata documents.
I've written about this extensively in some other answers here and I believe it's one of the biggest misunderstandings about using Cosmos effectively and at scale. It doesn't help that in many Cosmos examples they talk about picking a partitionKey like deviceId or postal code which implies that you're dealing with homogeneous documents.
Please refer to this question that I answered regarding homogeneous vs heterogeneous in documentdb. The biggest argument for this pattern is the new addition of Graph APIs in Cosmos which necessitate having many different types of entities in a single collection and supports exactly the use case you're describing minus the extra collections. Obviously when dealing with heterogeneous types there isn't going to be a single property present on all documents that is appropriate for a partition key which is why you need to go generic.
What you're trying to do is feasible. The convention you use is not particularly important, as long as you can figure out the reference. Keep in mind though, that using this type of "relations" will be rather slow, because you need to fetch all documents from one collection and then fetch the related documents in a separate query. It can have a serious impact on your application.
Another possibility is to optimise your data for reading: you can embed the metadata document inside the other document. Your data will be duplicated, so if you update those documents, you will have to update them in both collections, but you'll probably write less often than you read (probably, if that's not the case, this setup would be worse).
Your documents would look like this:
Metadata document in collection "Metadata":
{
"id": "metadata1",
...
}
Measurement document in collection "Measurements":
{
"id": "measurement1",
"metadata" : {
"id": "metadata1",
...
},
...
}
How does DocumentDb handle the case, when a document update results in exceeding the collection size (10 GB). Say I have 50K documents in one of my collection and then I update all of the documents to include an additional JSON section that could exceed the collection size.
What are the best practices to handle this case and is there built in support to handle this scenario (e.g. Move that document to another collection).
There's no specific best practice, but you have specific things built into DocumentDB to help you make proper decisions:
x-ms-resource-usage is a header returned on your queries. Among other things, collectionSize will report total consumption within your collection, including overhead from indexes, etc. You can compare that to collectionSize in the x-ms-resource-quota header returned (which should equate to 10GB), to know how much overhead you have remaining. There's a bit more detail in this answer.
The various language-level drivers provide partitioning support. When you realize you need to span multiple partitions, you can implement a partition resolver, to allow content to be written across multiple partitions. There are several answers covering partitioning thoughts, such as this one posted by Larry Maccherone. And the DocumentDB team published an article on partitioning, here.
You're probably aware already, but: you can check for HTTP 403, which is returned when trying to insert documents and exceeding collection size. All error codes are documented here.
Regarding your question about moving documents to different collections: That's ultimately going to be your call whether to do this within your code or by taking advantage of partition resolvers.
In one article of msdn,
https://azure.microsoft.com/en-in/documentation/articles/documentdb-partition-data/,
there is a line which specifies that "sub-partitioning" or "complex partitioning" can be done. Does this mean :
There can be sub-partitioning inside a collection?
In a single DocumentDb, there can be more than one partitioning logic? For example, I will have four collections inside a single Document Db. Can two of them can be based on hash and the other two on range?
If either of those answers is YES, then can someone provide me a link that might lead me to an example of the same?
Answers:
There is no explicit method to sub-partition data within a collection. It's common to use a field to represent the type of document or to have isTypeA: true key value pairs on each document, but that's a convention that your application adopts. However, you can create multiple databases (default limit 5 but may be extended upon request) per account and each can have their own set of collections. I'm using that two-level hierarchy in (temporalize-api). TenantID determines my top-level partitioning (database) using a lookup table plus defaults. This allows me to pull critical or high value tenants into a less loaded database and leave everyone else in the default. I use a consistent hash on the EntityID for second-level partitioning (collection).
Sure, there is nothing preventing you from doing that. Pay particular attention to the excellent discussion in the last section (Developing a partitioned application) in the Aravind article you linked to. It includes a checklist of things you'll need to decide upon and implement. The partition resolvers provided for the .NET SDK do not take care of these issues for you.
I haven't yet seen open source examples of what I would consider a complete system including balancing when capacity is added, where to store the partition maps/meta-data, and query fan-out/aggregate optimization. I have a node.js one under way (temporalize-api) and actually in production. I've made decisions about how I'm going to do balancing and query fan-out and those are documented in the comments in that linked file, but I have not implemented all of them. I store the partition meta-data in the "first" collection of the "first" database.