Document update resulting in collection greater than 10GB

Document update resulting in collection greater than 10GB - azure

How does DocumentDb handle the case, when a document update results in exceeding the collection size (10 GB). Say I have 50K documents in one of my collection and then I update all of the documents to include an additional JSON section that could exceed the collection size.
What are the best practices to handle this case and is there built in support to handle this scenario (e.g. Move that document to another collection).

There's no specific best practice, but you have specific things built into DocumentDB to help you make proper decisions:
x-ms-resource-usage is a header returned on your queries. Among other things, collectionSize will report total consumption within your collection, including overhead from indexes, etc. You can compare that to collectionSize in the x-ms-resource-quota header returned (which should equate to 10GB), to know how much overhead you have remaining. There's a bit more detail in this answer.
The various language-level drivers provide partitioning support. When you realize you need to span multiple partitions, you can implement a partition resolver, to allow content to be written across multiple partitions. There are several answers covering partitioning thoughts, such as this one posted by Larry Maccherone. And the DocumentDB team published an article on partitioning, here.
You're probably aware already, but: you can check for HTTP 403, which is returned when trying to insert documents and exceeding collection size. All error codes are documented here.
Regarding your question about moving documents to different collections: That's ultimately going to be your call whether to do this within your code or by taking advantage of partition resolvers.

Related

Amazon QLDB Design Pricinples

I'm wondering the best way to design tables in QLDB and whether it's best to perform joins or perhaps have nested documents.
For example, if I have the tables transaction and payment where a payment must be associated to a transaction. Which of the following options are best;
Nested Document Option (One table)
{
'payment_reference': 'abc123',
'transaction': {
'id': 123,
'name': 'John Doe',
'amount': '$10'
},
'fees': '$2',
'amount_paid': '$12'
}
Two Table Option
Payment Document
{
'payment_reference': 'abc123',
'transaction_id': 12,
'fees': '$2',
'amount_paid': '$12'
}
Transaction Document
{
'id': 123,
'amount': '$10',
'name': 'John Doe',
}

I think #Aurgho has answered your question. But I am going to put my general thoughts based on what Aurgho said, which might help others coming to this post with similar question.
There are multiple factors that can influence your design decision, along with the quotas and limits QLDB imposes. Here are few pointers that might help you think forward:
Query Pattern: At this point, Amazon QLDB allows creation of indexes only on the top level fields. In the nested document design(Option #1), if your queries are going to be on any of the fields of the nested document, then those queries won't use index and will perform scans. This can impact your performance. With Option #2, you can have indexes on both the tables and use those indexed fields in your join criteria.
Access pattern: Are you going to have significantly more writes than reads? If your reads are sparse and not extremely sensitive to a little elevated latency, Option #1 might be better from data modeling perspective, where are all the payment related information is captured in a single document. On the other hand, if you have a lot more reads and the reads are latency sensitive, you should evaluate your options from the previous point's perspective.
Quotas and Limits: Amazon QLDB has quotas on the document size (which is currently at 128 KB) https://docs.aws.amazon.com/qldb/latest/developerguide/limits.html#limits.fixed. If your plan to add more fields as you go, the per document size can keep increasing with the nested fields and you might eventually run into the document size limit. There are other quotas too which can impact your decision based on your use case.
Generally speaking, if you are not going to query on a field in the nested document and/or your writes >>> reads and/or your reads are not super sensitive to latency and/or your document size will stay within the currently imposed limits, you could do with Option #1. Having all your data in one document can ease you at the application layer when you are pushing the data into QLDB(just one insert) and when you have to process the documents in your code, but you will have to choose your trade-offs correctly.
These are just general pointers to help you think forward. You could have other use cases where either of the design options becomes more convincing than the other and you can trade-off certain advantages/disadvantages between the two.
Also, QLDB has some recommendations to optimize your query performance, which can further help you with your decision https://docs.aws.amazon.com/qldb/latest/developerguide/working.optimize.html

If, as in the nested document option, transaction documents are chosen to be nested inside payment documents, please keep in mind that the document size limit is 128KB as mentioned in the QLDB limits documentation . If the payment document can be foreseen to be large enough to hit this limit after nesting, this option could be risky.
If you foresee having to index on some of the fields in the transaction documents, you can create two separate tables and perform a join instead. (As noted in the create index reference, QLDB does not allow indexing on nested values of document and as mentioned in our limits documentation, AWS QLDB allows a maximum of 5 indexes per table)
The above recommendations are only based on the information provided in the post and we are unaware of the current access patterns in this use-case and will require further understanding to be able to answer better.
You can reach out to the team at qldb-outbound AT amazon.com for further consultation regarding your use-case.
Thanks

MongoDB schema design

I'm planning to implement this schema in MongoDB, I have been doing some readings about schema design, and the notion was whenever you structure your data like a relational database you must be doing something wrong.
My questions:
what should I do when collection size gets larger than 16MB limit?
app_log in server_log collections gets might in some cases grow larger than 16MB depending how busy the server is.
I'm aware of the cap feature that I could use, but the requirement is store all logs for 90 days.
Do you see any potential issues with my design?
Is it a good practice to have the application check collection size and create new collection by day / hour ..etc to accommodate log size growth?
Thanks

Your collection size is not restricted to 16MB, as one of the comments pointed out, you can check in the MongoDB manual that it is the largest document size. So there is no need to separate the same class of data between different collections, in fact it would be a major headache for you to do so :) One user collection, one for your servers and one for your server_logs. You can then create references from one collection to the next by using the id field.

Whether this is a good design or not will depend on your queries. In general, you want to avoid using joins in Mongo (they're still possible, but if you're doing a bunch of joins, you're using it wrong, and really should use a relational DB :-)
For example, if most of your queries are on the server_log collection and only use the fields in that collection, then you'll be fine. OTOH, if your server_log queries always need to pull in data from the server collection as well (say for example the name and userId fields), then it might be worth selectively denormalizing that data. That's a fancy way of saying, you may wish to copy the name and userId fields into your server_log documents, so that your queries can avoid having to join with the server collection. Of course, every time you denormalize, you add complexity to your application which must now ensure that the data is consistent across multiple collections (e.g., when you change the server name, you have to make sure you change it in the server_logs, too).
You may wish to make a list of the queries you expect to perform, and see if they can be done with a minimum of joins with your current schema. If not, see if a little denormalization will help. If you're getting to the point where either you need to do a bunch of joins or a lot of manual management of denormalized data in order to satisfy your queries, then you may need to rethink your schema or even your choice of DB.

what should I do when collection size gets larger than 16MB limit
In Mongodb there is no limit for collection size. Limit is exist for each document. Each document should not exceed the size of 16 MB.
Do you see any potential issues with my design?
No issue with above design

sub partitioning or composite partitioning document db

In one article of msdn,
https://azure.microsoft.com/en-in/documentation/articles/documentdb-partition-data/,
there is a line which specifies that "sub-partitioning" or "complex partitioning" can be done. Does this mean :
There can be sub-partitioning inside a collection?
In a single DocumentDb, there can be more than one partitioning logic? For example, I will have four collections inside a single Document Db. Can two of them can be based on hash and the other two on range?
If either of those answers is YES, then can someone provide me a link that might lead me to an example of the same?

Answers:
There is no explicit method to sub-partition data within a collection. It's common to use a field to represent the type of document or to have isTypeA: true key value pairs on each document, but that's a convention that your application adopts. However, you can create multiple databases (default limit 5 but may be extended upon request) per account and each can have their own set of collections. I'm using that two-level hierarchy in (temporalize-api). TenantID determines my top-level partitioning (database) using a lookup table plus defaults. This allows me to pull critical or high value tenants into a less loaded database and leave everyone else in the default. I use a consistent hash on the EntityID for second-level partitioning (collection).
Sure, there is nothing preventing you from doing that. Pay particular attention to the excellent discussion in the last section (Developing a partitioned application) in the Aravind article you linked to. It includes a checklist of things you'll need to decide upon and implement. The partition resolvers provided for the .NET SDK do not take care of these issues for you.
I haven't yet seen open source examples of what I would consider a complete system including balancing when capacity is added, where to store the partition maps/meta-data, and query fan-out/aggregate optimization. I have a node.js one under way (temporalize-api) and actually in production. I've made decisions about how I'm going to do balancing and query fan-out and those are documented in the comments in that linked file, but I have not implemented all of them. I store the partition meta-data in the "first" collection of the "first" database.

Real time multi threaded max-heap for top-N geohash

There is a requirement to keep a list of top-10 localities in a city from where demand for our food service is emanating at any given instant. The city could have tens of thousands of localities.
If one has to make a near real time (lag no more than 5 minutes) datastore in memory that would
- keep count of incoming demand by locality (geo hash)
- reads by hundreds of our suppliers every minute (the ajax refresh is every minute)
I was thinking of a multi threaded synchronized max-heap. This would be a complex solution as tree locking is by itself a complex implementation.
Any recommendations for the best in-memory (replicatable master slave) data structure that can be read and updated in multi threaded environment?
We expect 10K QPS and 100K updates per second. When we scale to other cities and regions, we will need per city implementation of top-10.
Are there any off the shelf solutions available?
Persistence is not a need so no mySQL based solutions. If you recommend redis or mongo DB solution, please realize that the queries are not pointed-queries by key but a top-N query instead.
Thanks in advance.

If you're looking for exactly what you're describing, there are a few approaches that might work nicely. There are several papers describing concurrent data structures that could work as priority queues; here is one option that I'm not super familiar with but which looks promising. You might also want to check out concurrent skip lists, which should also match your requirements.
If I'm interpreting your problem statement correctly, you're hoping to maintain a top-10 list of locations based on the number of hits you receive. If that's the case, I would suspect that while the number of updates would be huge, the number of times that two locations would switch positions would not actually be all that large. In other words, most updates wouldn't actually require the data structure to change shape. Consequently, you could consider using a standard binary heap where each element uses an atomic-compare-and-set integer key and where you have some kind of locking system that's used only in the case where you need to add, move, or delete an element from the heap.
Given the scale that you're working at, you may also want to consider approximate solutions to your problem. The count-min sketch data structure, for example, was specifically designed to estimate frequent elements in a data stream and does so extremely quickly. It can easily be distributed and linked up with a priority queue in a manner similar to what I described above. There are lots of good implementations out there, and if I remember correctly this data structure is actually deployed in situations like the one you're describing.
Hope this helps!

What is the best practice for mongoDB to handle 1-n n-n relationships?

In relational database, 1-n n-n relationships mean 2 or more tables.
But in mongoDB, since it is possible to directly store those things into one model like this:
Article{
content: String,
uid: String,
comments:[Comment]
}
I am getting confused about how to manage those relations. For example, in article-comments model, should I directly store all the comments into the article model and then read out the entire article object into JSON every time? But what if the comments grow really large? Like if there is 1,000 comments in an article object, will such strategy make the GET process very slow every time?

I am by no means an expert on this, however I've worked through similar situations before.
From the few demos I've seen yes you should store all the comments directly in line. This is going to give you the best performance (unless you're expecting some ridiculous amount of comments). This way you have everything in your document.
In the future if things start going great and you do notice things going slower you could do a few things. You Could look to store the latest (insert arbitrary number) of comments with a reference to where the other comments are stored, then map-reduce old comments out into a "bucket" to keep loading times quick.
However initially I'd store it in one document.
So would have a model that looked maybe something like this:
Article{
content: String,
uid: String,
comments:[
{"comment":"hi", "user":"jack"},
{"comment":"hi", "user":"jack"},
]
"oldCommentsIdentifier":12345
}
Then only have oldCommentsIdentifier populated if you did move comments out of your comment string, however I really wouldn't do this for less then 1000 comments and maybe even more. Would take a bit of testing here to see what the "sweet" spot would be.

I think a large part of the answer depends on how many comments you are expecting. Having a document that contains an array that could grow to an arbitrarily large size is a bad idea, for a couple reasons. First, the $push operator tends to be slow because it often increases the size of the document, forcing it to be moved. Second, there is a maximum BSON size of 16MB, so eventually you will not be able to grow the array any more.
If you expect each article to have a large number of comments, you could create a separate "comments" collection, where each document has an "article_id" field that contains the _id of the article that it is tied to (or the uid, or some other field unique to the article). This would make retrieving all comments for a specific article easy, by querying the "comments" collection for any documents whose "article_id" field matches the article's _id. Indexing this field would make the query very fast.
The link that limelights posted as a comment on your question is also a great reference for general tips about schema design.

But if solve this problem by linking article and comments with _id, won't it kinda go back to the relational database design? And somehow lose the essence of being NoSQL?
Not really, NoSQL isn't all about embedding models. Infact embedding should be considered carefully for your scenario.
It is true that the aggregation framework solves quite a few of the problems you can get from embedding objects that you need to use as documents themselves. I define subdocuments that need to be used as documents as:
Documents that need to be paged in the interface
Documents that might exist across multiple root documents
Document that require advanced sorting within their group
Documents that when in a group will exceed the root documents 16meg limit
As I said the aggregation framework does solve this a little however your still looking at performing a query that, in realtime or close to, would be much like performing the same in SQL on the same number of documents.
This effect is not always desirable.
You can achieve paging (sort of) of suboducments with normal querying using the $slice operator, but then this can house pretty much the same problems as using skip() and limit() over large result sets, which again is undesirable since you cannot fix it so easily with a range query (aggregation framework would be required again). Even with 1000 subdocuments I have seen speed problems with not just me but other people too.
So let's get back to the original question: how to manage the schema.
Now the answer, which your not going to like, is: it all depends.
Do your comments satisfy the needs that they should separate? Is so then that probably is a good bet.

There is no best way to this. In MongoDB you should be designing your collections according to application that is going to use it.
If your application needs to display comments with article, then I can say it is better to embed these comments in article collection. Otherwise, you will end up with several round trips to your database.
There is one scenario where embedding does not work. As far as I know, document size is limited to 16 MB in MongoDB. This is quite large actually. However, If you think your document size can exceed this limit it is better to have separate collection.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string