Azure DocumentDB Database/Collection limits

Azure DocumentDB Database/Collection limits - azure

What is the limit on the amount of collections a single database on DocumentDB can have? I keep landing on this link for general DocumentDB limits, but nothing on here goes into that detail:
https://learn.microsoft.com/en-us/azure/documentdb/documentdb-limits
I may need up to 200 collections running on 1 DocumentDB database at a given time. This is to partition customer data by collection. If this is not possible then I'll have to partition across multiple databases but I can't find the information I need to figure out the proper partitioning strategy!
Also, do I get charged for empty databases, or not until the first collection is created?

There is no limit to the number of collections (no practical limit anyway), which is why it's not listed in the limits page. To provision for example, 200 collections or more, you have to contact billing support.
Empty databases are not charged in DocumentDB.

Related

Azure Cosmos db : requests exceeding rate limit when bulk deleting records

I have one user bulk deleting some 50K documents from one container using a stored procedure.
Meanwhile another user is trying to login to the web app (connected to same cosmos db) but the request fails due to rate limit being exceeded.
What should be the best practice in this case in order to avoid service shortages like the one described?
a) Should I provision RUs by collection?
b) Can I set a cap on the RU's consumed by bulk operations from code when making a request?
c) is there any other approach?
More details on my current (naive/newbie) implementation:
Two collections : RawDataCollection and TransformedDataCollection
Partition key values are the customer account number
RU set at the database level (current dev deployment has minimum 400RUs)
Bulk insert/delete actions are needed in both collections
User profile data (for login purposes, etc.) is stored in RawDataCollection
Bulk actions are low priority in terms of service level, meaning it could be put on hold or something if a higher priority task comes in.
Normally when user logs in, retrieves small amounts of information. This is high priority in terms of service level.

It is recommended to not use Stored Procedures for bulk delete operations. Stored procedures only operate on the primary replica meaning they can only leverage 1/4 of total RU/s provisioned. You will get better throughput usage and more efficiency doing bulk operations using SDK client in Bulk Mode.
Whether you provision throughput at the database level or container level depends on a couple of things. If you have a large number of containers that get roughly the same number of requests and storage, database level throughput is fine. If the requests and storage is asymmetric then provision those containers which diverge greatly from the others with their own dedicated throughput. Learn more about the differences.
You cannot throttle requests on a container directly. You will need to implement Queue-based load leveling in your application.
Overall if you've provisioned 400 RU/s and trying to bulk delete 50K records, you are under provisioned and need to increase throughput. In addition, if you're workload is highly variable with long periods of little to no requests with short periods of high volume, you may want to consider using Serverless throughput or Autoscale

One database per COSMOS instance or multiple databases in single COSMOS instance?

We are new to COSMOS and migrating our multiple applications to cloud. What would be the pros & cons if we have one database per COSMOS instance or all applications databases in single COSMOS instance, will that be cost effective? Because if Microsoft bill on usage RU/s and storage, and not on how many instances of COSMOS are running, what difference it will make to have single database in each COSMOS instance?
Example-
Approach A :
COSMOS Resource1 > Database1 > Container1
COSMOS Resource1 > Database2 > Container2
Approach B:
COSMOS Resource1 > Database1 > Container1
COSMOS Resource2 > Database2 > Container2
Which approach is better?

Pros
The database can have multiple containers. Each container can have it's own RU quota or you can have them share RU's by placing the quota on the database level. This could save you money by sharing RU's across your whole suite of container needs without the hassle of managing each containers cost.
You get the ease of connection information as your endpoint and key are the same for all of your containers as they are in one database.
Adding more RU's benefits all containers not just one.
Cons
If you have a really read/write intensive application that takes up a lot of RU's, combining containers under one provision quota could leave your other applications receiving errors as there are no RU's left for them to perform their operations.
Someone obtains your key and endpoint, all of your containers are exposed since they are on under the same database. This can expose your companies full data inventory to a hacker.
You can't control cost to a fine point. Meaning if you have a container that doesn't need much RU's, this container could have a 400 RU provision and only cost you $20 or dollars, while you place the bulk of your budget toward the RU hungry app. Separation allows you pin point control over RU distibution/cost.
https://learn.microsoft.com/en-us/azure/cosmos-db/set-throughput
Additional tid bits.
Change feed allows you to connect a function, etc. to events within cosmos and it would allow you to sync data to an outside database like SQL Server, ElasticSearch, and/or Redis. I've seen a lot of people/companies use that serverless power to sync ElasticSearch with very little code.
Make sure you choose your partition key carefully and never do a operation without it. Sometimes the difference can be over 100 RU's on a query.

Azure Search scalability

We are developing a mobile app that should scale for thousands of users and we are using Azure Search as our main storage. According to Azure pricing model the query limits are set to 15 queries per second/per unit for the standard plan. With these limits and with a system that should scale with thousands of concurent users we would hit the limits pretty quickly.
In our situation is Azure Search not the right option when scaling for thousands of concurrent users?
Would DocumentDB be a better option?
Thanks!

Interesting that you're using Azure Search as your primary storage, as it's not built to be a database engine. The storage is specifically for search content (type typical pattern is to use Azure Search in conjunction with a database engine, such as SQL Database or DocumentDB, for example), using results to point back to the "system of record" content in your database.
The scale for Search is specifically for full-text-search queries your users will generate. And Azure Search scales per unit, with each unit offering 15 searches / second. So, you can scale far beyond 15/sec if you buy more search units.
However: Don't confuse this with database engine queries. You asked about DocumentDB, so using that as an example: You can query far beyond 15/second with that database engine, and that scales independently. Same goes for any VM-based database solution, SQL Database, etc - they all can scale.
This really comes down to whether you need full-text-search at high volume. If so, great - just scale Azure Search to the number of units you need, to handle your request traffic. If you can do more database-specific searches, without driving your request through Azure Search, then you don't need to scale out as much, and can take advantage of the native database query capabilities.

One thing to add to David's excellent answer - if your scenario is primarily search driven and you don't need to store data for purposes other than search and are OK with eventual consistency, then using Azure Search as the primary store may be fine.
Also, 15 requests per second query throughput of Azure Search is just a ballpark - it's neither a hard limit nor a promise. Depending on your data and query complexity, the actual throughput can be significantly (many times) higher or lower.

Microsoft Azure DocumentDB vs Azure Table Storage

For several recent years, Microsoft offers a "NoSQL" key/value storage, called "Table Storage" (http://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-tables/)
Table Storage offers a high performance, scalability (via partitioning) and relatively low cost. A primary drawback of Tables that only Partition and Row keys can be indexed - so making queries on values is very inefficient.
Recently Microsoft announced a new "NoSQL" service, called "DocumentDB" (http://azure.microsoft.com/en-us/documentation/services/documentdb/)
Instead of storing a list of properties (like Tables do), DocumentDB stores JSON objects. The whole object being indexed - so efficient queries may be created based on every property and any nested property of stored objects.
Microsoft says that DocumentDB provides high performance and scalability as well.
If that's so - why anyone would use Table Storage over DocumentDB? It sounds like DocumentDB provides the same functionality as Tables, but with additional capabilities such as the ability to index anything.
I will glad if someone could make a comparison between DocumentDB and Table Storage, highlighting cons and pros of each one.

Both are NoSQL technologies, but they are massively different. Azure Tables is a simple Key/Value store and does not support complex functionality like complex queries (most of them will require a full partition/table scan anyway, which will kill your performance and your cost savings), custom indexing (indexing is based on PartitionKey and RowKey only, you currently can't index on any other entity property and searching for anything other than PartitionKey/RowKey combination will require a partition/table scan), or stored procedures. You also can't batch read requests for multiple entities (through batch write requests are supported if all the entities belong to the same partition). For a real-life application of Azure Tables, see HERE.
If your data needs (particularly around querying them) are simple (like in the example above), then Azure Tables provide what you need, you might end up using that in favor of DocDB due to pricing, performance and storage capacity. For example, Azure Tables performance target is 20.000 operations per second. Trying to get that same level of performance on DocDB will have a significantly higher service cost for you. Also, Azure tables are limited by the capacity of your Azure storage account (500TB), whereas DocDB storage is limited by the capacity units you buy.

Table Services is mainly a key-value type NOSQL and DocumentDB is (as the name suggests) a Document Type NoSQL store. What you are asking is essentially the difference between these two types of NOSQL approaches. If you shape your research according to this you should be able to get a better understanding for sure.
Just to keep things simple I suggest you consider the differences between how DocumentDB and Table Services are priced. Not only the cost of these services vary a lot from each other but the fact that DocumentDB works on a "provision first" model and Table Services are offered on a pure consumption based pricing might give you some clues on your compare/contrast.
Let me ask you this; why would I use DocumentDB if the features in Table Services well serve my needs? ;) I suggest you to take a look at how the current Azure Diagnostics tooling use Azure Storage Services, how Storage Metrics use Azure Storage on itself to get a sense of how useful Table Services would be and how overkill DocumentDB might be in some situations.
Hope this helps.

I think that the comparison is all about trading price for performance. Table Services are just Storage Services, which seem to cap out at 20,000 ops/second, but paying for that kind of throughput all the time (because Storage gives it to us all the time) is $1,200/month. Crazy money.
Table services have simple indexes, so queries are very limited. Good for anything that is written and read via IDs. DocumentDB indexes the entire document, so a query can be done on any property.
And lastly, Table services are bound by the storage constraint of the Storage account it's on (which could get crazy high given negotiation with Microsoft directly), where DocumentDB storage seems unlimited.
So it's a balance. Do you have a LOT of data (hundreds of gigs, or terabytes) that you need in one place? DocumentDB. Do you need to support complex queries? DocumentDB. Do you have data that needs to come and go fast, but based on a 1-to-2 property lookup? Table services. Would you trade having to code around a simple index in order to avoid paying through the nose for throughput? Table services.
And Redis, someone mentioned that... man, I dunno. Even the existence of persistence in a caching framework (which Redis offers) doesn't turn it into a tech of choice... There is a huge difference between a persistent store that holds data that is "often used, but may be missing or time-retired", like a cache would, and a persistent store that guarantees your data to be there.

A real life example:
I have to store some tokens, retrieve them, delete them. Only query ever done will be based on User ID.
So I use Table Storage, as it fulfill my requirement perfectly. I save the token against User ID.
Document DB seemed to be overkill for this.

Here is the answer from microsoft's official docs
Common attributes of Cosmos DB, Azure Table Storage, and Azure SQL Database:
99.99 availability SLA
Fully managed database services
ISO 27001, HIPAA and EU Model Clauses Compliant
The following table shows the uncommon attributes of Azure Cosmos DB,
Azure Table Storage

Limitations on Windows Azure Table Storage accounts

I am designing a multi-tennant web-based SaaS application that will be hosted on Windows Azure and use Table Storage.
The only limits I have found so far are:
5 storage accounts per subscription
100 TB maximum per storage account
1 MB per entity
I am deciding how to best partition my storage for multiple customers:
Option 1: Give each customer their own storage account. Not likely, considering the 5 account default limit.
Option 2: Give each customer their own set of tables. Prefix the table names with customer identifiers, such as a Books table split as "CustA_Books", "CustB_Books", etc.
Option 3: Have one set of tables, but prefix the partition keys to split the customers. So one "Books" table with partition keys of "CustA_Fiction", "CustA_NonFiction", "CustB_Fiction", "CustB_NonFiction", etc.
What are the pros and cons for options 2 and 3? Is there a limit to the number of tables in a single account that might affect option 2?

There are no limits to the number of tables you can create in Windows Azure. Your only limits ar the ones you have already listed. Well... I guess there are other limits if you consider the size of the entity attribute is always 64KB or less or if you consider batch options (100 entities or 4MB, whatever is the lesser).
Anyhow, the thing to keep in mind here is that your PartitionKey is going to be the most important thing you make. If you create a PK with the customer name in it, you get some good partitioning benefits. The downside to this is that if you mix the customer data in the same table, you make it harder on yourself to delete data (if you ever need to delete a customer). So, you can use the table as another level of partitioning. The PK you create is scoped to the table you create it under.
What I would consider here is if you ever need to delete the data in bulk or if you ever need to query data across customers (tenants). For the first one, it makes a ton of sense to use separate tables per customer so a delete is one operation versus at best 1 per 100 entities. However, if you need to query across tenants it is harder to join this data when you have multiple tables (that would require multiple queries).
All things being equal, I would use the tables as another level of partitioning if there is no overlap in tenant functionality and make my life easier should I want to delete a tenant. So, I guess that is option 2.
HTH

I highly suggest Option 2
We are also going this route because it adds a nice level or federation for the customer data. As the answered comment mentions it is easier to manage adding/deleting customers. Another benefit that we have noticed is the 'copy-abilty' of a customers data. This approach makes it much easier to move customer specific data to other storage accounts or to development environments for testing without affecting the entire lot.
In the SaaS world it also enables customers to get a copy of their own data with little effort, which is also a concern of many SaaS users.

Another alternative:
Imagine you have N storage accounts, the limit is 100 storage accounts per subscription. Each storage account have a table per customer.
For table request operations with Partition Key, like Insert, Update, Delete or a point query, you calculate hash value of customer name + partition key, calculate its modular of base N (total number of storage accounts), find the index of the exact storage account and forward the request to the correct storage account / table.
For read requests with no partition key, like a range query. Then you would need to broadcast the request to all storage accounts and merge the results.
One of the other things to keep in mind specifically around naming multiple storage accounts. Avoid naming the accounts lexicographically, that will cause them to be served from the same partition server on Azure backend and against their recommended scalability best practises. If you have N storage accounts. prefix each storage account name with a 3 digit hash, so they would be evenly distributed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string