What happen if Azure COSMOS container run out of RUs? [duplicate]

What happen if Azure COSMOS container run out of RUs? [duplicate] - azure

This question already has answers here:
How is cosmosDB RU throughput enforced
(2 answers)
Closed 2 years ago.
Example - 1000 RUs/hr and within hr 1000 RUs are eaten up, what happen to 1001 request?

Cosmos DB request units are per second not per hour.
So with 1000 RU/s you can run 1000 queries per second that each take 1 RU or you can run 100 queries per second that each take 10 RUs.
You might be confused because the billing is per hour based on the maximum amount of RUs you defined in that hour.
So if you set 1000 RUs on a collection and then change to 400 RUs, you still get billed for 1000 RUs for the current hour.
If you exceed that number, you get a 429 error back from Cosmos DB like the other answer states.
If you use the Cosmos DB SDK, you don't need to worry about this usually as they will automatically retry the query after some time if they get a 429.
This retry policy is configurable, so you can decide how many times retry should be attempted.
Getting some 429s is usually expected, otherwise you might be over-provisioning throughput in Cosmos DB.

That is a really in interesting question. Some information suggests if you hit the provisioned RU/s rate limit for any operation or query, a Cosmos DB service won't execute the operation and the API will throw a DocumentClientException exception with the HttpStatusCode property set to 429. This HTTP status code means that the request made to Azure Cosmos DB has exceeded the provisioned throughput and it couldn't be executed.
429 Too many requests
The collection has exceeded the provisioned throughput limit. Retry the request after the server specified retry after duration
Cosmos DB throttles your requests intelligently
When you’re exceeding your RU quota, Cosmos DB doesn’t reject your additional requests by just screaming ERROR! Not only does it explicitly flag these throttled requests with the HTTP status code 429, but the response also provides a very useful header: x-ms-retry-after-ms. As its name implies, this header tells you how much time you should wait before re-trying.
Although this hint has its own limits (it may not be very reliable if multiple clients overload your RU quota at the same time), it’s still a very useful information to have in order to define the cool-off period one should wait and avoid a retry policy that would be too aggressive.
Ref
https://learn.microsoft.com/en-us/rest/api/cosmos-db/http-status-codes-for-cosmosdb#:~:text=429%20Too%20many%20requests,more%20information%2C%20see%20request%20units.
https://medium.com/#thomasweiss_io/how-i-learned-to-stop-worrying-and-love-cosmos-dbs-request-units-92c68c62c938

Related

Cosmos DB metrics report 100x more requests than expected

I'm comparing the service side metrics of my app with the metrics emitted by Cosmos DB and I can see a 100x difference in request counts.
Is my container misconfigured? Am I querying the wrong way? Is Cosmos performing multiple requests internally for each query I'm running against it?
The metric I'm looking at in Cosmos is TotalRequests/Count/5min.
The container has indexes on all attributes + a few composite indexes.
The query I'm running is:
SELECT *
FROM x
WHERE x.partitionKey = 0
and x.index1 = 1
and x.index2 = 2
The container is suffering from a VERY hot partition.
Each request consumes about 5 RUs.
The consistency level is BOUNDED_STALENESS.
I tried changing the consistency level to EVENTUAL which brought the consumed RUs down, but I'm still seeing a huge amount of requests that aren't accounted for.

The Total Requests metric includes every request between the SDK and the service. The SDK makes frequent calls to the service when an SDK instance is first created, then makes regular calls for metadata and other information. If you want to see just requests made by user, apply a filter for OperationType and select the operations you want to monitor.
It's not clear why you were using Bounded Staleness. Reads using Strong and Bounded Staleness consume twice the RU/s because they read from 2 replicas rather than 1 replica for the other weaker consistency models. In addition to differences in cost, there are of course differences in whether you may read stale data or not. They also play a big role in your RTO and RPO in multi-region scenarios.
A hot partition does not have impact on throughput consumption. 5 RU/s for a query is actually very good.

Cosmos Write Returning 429 Error With Bulk Execution

We have a solution utilizing a micro-service approach. One of our micro-service is responsible for pushing data to Cosmos. Our Cosmos database is using serverless provision having a 5,000 RU/s limit.
The data we are inserting into Cosmos looks like the below. There are 10 columns and we are pushing a batch containing 5,807 rows of this data.
Id
CompKey
Primary Id
Secondary Id
Type
DateTime
Item
Volume
Price
Fee
1
Veg_Buy
csd2354csd
dfg564dsfg55
Buy
30/08/21
Leek
10
0.75
5.00
2
Veg_Buy
sdf15s1dfd
sdf31sdf654v
Buy
30/08/21
Corn
5
0.48
3.00
We are retrieving data from multiple sources, normalizing it, and sending out the data as one bulk execution to Cosmos. The retrieval process happens every hour. We understand that we are spiking the Cosmos database once per hour with the data that has been retrieved and then stop sending data until the next retrieval cycle. So if this high peak is the problem, what remedies exist for such a scenario?
Can anyone shed some light on what we should/need to do to overcome this issue? Perhaps we are missing a setting when creating the Cosmos database or possibly this has something to do with partitioning?

You can mostly determine these things by looking at the metrics published in the Azure Portal. This doc is a good place to start, Monitor and debug with insights in Azure Cosmos DB.
In particular I would look at the section titled, Determine the throughput consumption by a partition key range
If you are not dealing with a hot partition key you may want to look at options to throttle your writes. This may include modifying your batch size and putting the write operations on a while..loop with a one second timer until RU/s consumed equals 5000 RU/s. You could also possibly look at doing queue-based load leveling and put writes on a queue in front of Cosmos and stream them in.

Request Timeout in Azure Cosmos DB in sdk v3

I am inserting the data to azure cosmos db. In some time it throws an error (Request Timeout : 408). I have increased the Request Timeout to 10 mins.
Also, i have iterate each item from api and calling CreateItemAsync() method instead of bulk executor.
Data To Insert = 430 K Items
Microsoft.Azure.Cosmos SDK used = v3
Container Throughput = 400
Can anyone help me to fix this issue.

Just increase your throughput. But it's going to cost you a lot of money if you leave it increased. 400 RU/s isn't going to cut it unless you batch your operation to the point where it's going to take a long time to insert 400k items.
If this is a one-time deal, increase your RU/s to 2000+, then start slowly inserting items. I would say, depending on the size of your documents, maybe do 50 at a time, then wait 250 milliseconds, then do 50 more until you are done. You will have to play with this though.
Once you are done, move your RU/s back down to 400.
Cosmos DB can be ridiculously expensive, so be careful.
ETA:
This is from some documentation:
Increase throughput: The duration of your data migration depends on the amount of throughput you set up for an individual collection or a set of collections. Be sure to increase the throughput for larger data migrations. After you've completed the migration, decrease the throughput to save costs. For more information about increasing throughput in the Azure portal, see performance levels and pricing tiers in Azure Cosmos DB.

The documentation page for 408 timeouts lists a number of possible causes to investigate.
Aside from addressing the root cause with the SDK client app or increasing throughput, you might also consider leveraging Azure Data Factory to ingest the data as in this example. This assumes your data load is an initialization process and your data can be made available as a blob file.

What should be done when the provisioned throughput is exceeded?

I'm using AWS SDK for Javascript (Node.js) to read data from a DynamoDB table. The auto scaling feature does a great job during most of the time and the consumed Read Capacity Units (RCU) are really low most part of the day. However, there's a programmed job that is executed around midnight which consumes about 10x the provisioned RCU and since the auto scaling takes some time to adjust the capacity, there are a lot of throttled read requests. Furthermore, I suspect my requests are not being completed (though I can't find any exceptions in my error log).
In order to handle this situation, I've considered increasing the provisioned RCU using the AWS API (updateTable) but calculating the number of RCU my application needs may not be straightforward.
So my second guess was to retry failed requests and simply wait for auto scale increase the provisioned RCU. As pointed out by AWS docs and some Stack Overflow answers (particularlly about ProvisionedThroughputExceededException):
The AWS SDKs for Amazon DynamoDB automatically retry requests that receive this exception. So, your request is eventually successful, unless the request is too large or your retry queue is too large to finish.
I've read similar questions (this one, this one and this one) but I'm still confused: is this exception raised if the request is too large or the retry queue is too large to finish (therefore after the automatic retries) or actually before the retries?
Most important: is that the exception I should be expecting in my context? (so I can catch it and retry until auto scale increases the RCU?)

Yes.
Every time your application sends a request that exceeds your capacity you get ProvisionedThroughputExceededException message from Dynamo. However your SDK handles this for you and retries. The default Dynamo retry time starts at 50ms, the default number of retries is 10, and backoff is exponential by default.
This means you get retries at:
50ms
100ms
200ms
400ms
800ms
1.6s
3.2s
6.4s
12.8s
25.6s
If after the 10th retry your request has still not succeeded, the SDK passes the ProvisionedThroughputExceededException back to your application and you can handle it how you like.
You could handle it by increasing throughput provision but another option would be to change the default retry times when you create the Dynamo connection. For example
new AWS.DynamoDB({maxRetries: 13, retryDelayOptions: {base: 200}});
This would mean you retry 13 times, with an initial delay of 200ms. This would give your request a total of 819.2s to complete rather than 25.6s.

Azure DocumentDB Throttled Requests

I have a document db database on azure. I have a particularly heavy query that happens when I archive a user record and all of their data.
I was on the S1 plan and would get an exception that indicated I was hitting the limit of RU/s. The S1 plan has 250.
I decided to switch to the Standard plan that lets you set the RU/s and pay for it.
I set it to 500 RU/s.
I did the same query and went back and looked at the monitoring chart.
At the time I did this latest query test it said I did 226 requests and 10 were throttled.
Why is that? I set it to 500 RU/s. The query had failed, by the way.

Firstly, Requests != Request Units, so your 226 requests will at some point have caused more than 500 Request Units to be needed within one second.
The DocumentDb API will tell you how many RUs each request costs, so you can examine that client side to find out which request is causing the problem. From my experience, even a simple by-id request often cost at least a few RUs.
How you see that cost is dependent on which client-side SDK you use. In my code, I have added something to automatically log all requests that cost more than 10 RUs, just so I know and can take action.
It's also the case that the monitoring tools in the portal are quite inadequate and I know the team are working on that; you can only see the total RUs for every five minute interval, but you may try to use 600 RUs in one second and you can't really see that in the portal.
In your case, you may either have a single big query that just costs more than 500 RU - the logging will tell you. In that case, look at the generated SQL to see why, maybe even post it here.
Alternatively, it may be the cumulative effect of lots of small requests being fired off in a small time window. If you are doing 226 requests in response to one user action (and I don't know if you are) then you probably want to reconsider your design :)
Finally, you can retry failed requests. I'm not sure about other SDKs but the .Net SDK retries a request automatically 9 times before giving up (that might be another explanation for the 229 requests hitting the server).
If your chosen SDK doesn't retry, you can easily do it yourself; the server will return a specific status code (I think 429 but can't quite remember) along with an instruction on how long to wait before retrying.
Please examine the queries and update your question so we can help further.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string