Bulk delete in Azure CosmosDB leads to 429 error - azure

I've implemented bulk deletion as recommended with newer SDK. Created a list of tasks to delete each item and then awaited them all. And my CosmosClient was configured with BulkOperations = true. As I understand, it's implied that under the hood new SDK does its magic and performs bulk operation.
Unfortunatelly, I've encountered 429 response status. Meaning my multiple requests hit request rate limit (it is low, development only tier, but nontheless). I wonder, how single bulk operation might cause 429 error. And how to implement bulk deletion in not "per item" fashion.
UPDATE: I use Azure Cosmos DB .NET SDK v3 for SQL API with bulk operations support as described in this article https://devblogs.microsoft.com/cosmosdb/introducing-bulk-support-in-the-net-sdk/

You need to handle 429s for deletes the way you'd handle for any operation by creating an exception block, trapping for the status code, then checking the retry-after value in the header, then sleeping and retrying after that amount of time.
PS if you're trying to delete all the data in the container, it can be more efficient to delete then recreate the container.

Related

Maintain a distributed incremental counter in Azure cosmos DB

I am fairly new to cosmos DB and was trying to understand the increment operation that azure cosmos DB SDK provides for Java for patching a document.
I have a requirement to maintain an incremental counter in one of the Documents in the container. The document looks like this-
{"counter": 1}
Now from my application I want to increment this counter by a value of 1 every time an action happens. For this I am using CosmosPatchOperations. I add an increment here like this cosmosPatch.increment("/counter", 1) which works fine.
Now this application can have multiple instances running, all of them talking to same document in the cosmos container. So App1 and App2 both could trigger an increment at the same time. The SDK method returns the updated document and I need to use that updated value.
My question here would be that does cosmos DB here employ some locking mechanism to make sure both the patches happen one after another and also in this case what would be the updated value that I would get in App1 and App2 (The SDK method returns the updated document). Will it be 2 in one of them and 3 in the other one?
Couchbase supports such a counter at cluster level as explained here and it has been working perfectly for me without any concurrency issues. I am now migrating to cosmos Db and have been struggling to find how can this be achieved.
Update 1:
I decided to test this. I set up the cosmos emulator in my local mac and created a DB and container with automatically increasing RUs starting from 1 to 10K. Then in this container I added a document like this -
{
"id": "randomId",
"counter": 0
}
Post this I created a simple API whose responsibility is just to increment the counter by 1 every-time it is invoked. Then I used locust to invoke this API multiple times to mimic a small load-like scenario.
Initially the test ran fine with each invocation receiving a counter like it is supposed to (in an incremental manner). On increasing the load I saw some errors namely RequestTimeOutException with status code 408. Other requests were still working fine with them getting the correct counter value. I do not understand what caused RequestTimeOut exceptions here. The stack trace hints something to do with concurrency but I am not able to get my head around it. Here's the stack trace-
Update 2:
The test run in Update 1 was done on my local machine and I realised I might have resource issues on my local leading to those errors. Decided to test this in a Pre-Prod environment with actual cosmos DB and not emulator.
Test configuration-
Cosmos DB container with RUs to automatically scale from 400 to 4000
2 instances of application sharing the load.
Locust script to ingest load on the application
Findings-
Up until ~170 TPS, everything was running smoothly. Beyond that I noticed errors belonging to 2 different buckets-
"exception": "["Request rate is large. More Request Units may be needed, so no changes were made. Please retry this request later. Learn more: http://aka.ms/cosmosdb-error-429"]".
I am not sure how 170 odd patch operations would have exhausted 4000 RUs but that's a different discussion altogether.
"exception": "["Conflicting request to resource has been attempted. Retry to avoid conflicts."]", with status code 449.
This error clearly indicates that cosmos DB doesn't handle concurrent requests. I want to understand if they maintain a queue internally to handle some requests or they don't handle any concurrent writes at all.
PATCH is not different from other operations, Fundamentally CosmosDB implements Optimistic Concurrency Control unlike the relational databases which have these mechanisms. Optimistic Concurrency Control (OCC) allows you to prevent lost updates and to keep your data correct. OCC can be implemented by using the etag of a document. T Each document within Azure Cosmos DB has an E_TAG property.
In your scenario, yes it will return 2 in one of them and 3 in other one given both get succeeded, because SDK has the retry mechanism and it's explained here. Also have a look at this sample.
If your Azure Cosmos DB account is configured with multiple write
regions, conflicts and conflict resolution policies are applicable at
the document level, with Last Write Wins (LWW) being the default
conflict resolution policy

Dealing with Azure Cosmos DB cross-partition queries in REST API

I'm talking to Cosmos DB via the (SQL) REST API, so existing questions that refer to various SDKs are of limited use.
When I run a simple query on a partitioned container, like
select value count(1) from foo
I run into a HTTP 400 error:
The provided cross partition query can not be directly served by the gateway. This is a first chance (internal) exception that all newer clients will know how to handle gracefully. This exception is traced, but unless you see it bubble up as an exception (which only
happens on older SDK clients), then you can safely ignore this message.
How can I get rid of this error? Is it a matter of running separate queries by partition key? If so, would I have to keep track of what the existing key values are?

How can we reprocess cosmos lease document using Azure Function?

I am new to Azure Function, recently we tried to use CosmosDBTriggered Function that needs to create the lease document, we noticed that when there is something changed in the Cosmos Container, there will be a new entry added into the lease document, but we don't understand what these items mean and how could we use it in other scenario instead just log it. In addition, sometimes we would have an exception in the CosmosDBTriggered Function, while exception happened our function just stops itself and we're losing all changed documents in that instance, so we're thinking if there is anyway to recapture our changed items in last triggered event by using the lease document, but not sure what the lease document could tell us, could someone explain if that is approachable?
From the official documentation at https://learn.microsoft.com/azure/cosmos-db/change-feed-functions
The lease container: The lease container maintains state across multiple and dynamic serverless Azure Function instances and enables dynamic scaling. This lease container can be manually or automatically created by the Azure Functions trigger for Cosmos DB. To automatically create the lease container, set the CreateLeaseCollectionIfNotExists flag in the configuration. Partitioned lease containers are required to have a /id partition key definition.
Going to your second question, error handling. The reference document is: https://learn.microsoft.com/azure/cosmos-db/troubleshoot-changefeed-functions
The Azure Functions trigger for Cosmos DB, by default, won't retry a batch of changes if there was an unhandled exception during your code execution.
If your code throws an unhandled exception, the current batch of changes that was being processed is lost because the Function will exit and record an Error, and continue with the next batch.
In this scenario, the best course of action is to add try/catch blocks in your code and inside the loops that might be processing the changes, to detect any failure for a particular subset of items and handle them accordingly (send them to another storage for further analysis or retry).
So, make sure you have try/catch blocks in your foreach/for statements, detect any Exception, deadletter that failed document, and continue with the next in the batch.
This approach is common to all event-based Function triggers, like Event Hub. For reference: https://hackernoon.com/reliable-event-processing-in-azure-functions-37054dc2d0fc
If you want to reset a Cosmos DB Trigger to go back and replay the documents from the start, after already having the Trigger working for some time, you need to:
Stop your Azure function if it is currently running.
Delete the documents in the lease collection (or delete and re-create the lease collection so it is empty)
Set the StartFromBeginning CosmosDBTrigger attribute in your function to true.
Restart the Azure function. It will now read and process all changes from the beginning.

How to retry failed update requests(412 error code) in azure table storage operations?

I am looking for a solution to handle the update failures in azure table operations due to error code 412.My application will make concurrent update requests to the table and most of the time it fails with code 412.At that time I need to retry the request and make it right.
The type of updates are like, each request will read the data and union it with new data and place it back.The challenge is, my application need to handle large amount of requests like this in fraction of seconds
From what you describe using azure table storage I do not think there is any other way than what you are doing already. So either update the ETag and resend the request or else unconditionally overwrite what is there which will cause you lose updates.
In your case I would also experiment with Azure Document Db so you could in theory push that union logic to the server side as a stored procedure which could transparently do the retry on ETag failure. That should be much faster because you would not need to do any I/O request for retries from client side and assume whatever you get back from the db is the latest and the most up to date Boundary entity.

Scaling an Azure Elastic Pool with .NET Fluent API

I'm using the Azure Fluent API, Azure Management Libraries for .NET, to scale the DTU's within an Azure Elastic Pool and would like to know if it's possible to trigger an update without having to wait for the processing to complete.
Currently the following block of code will wait until the Elastic Pool has finished scaling before it continues execution. With a large premium Elastic Pool this could mean that the this line will take up to 90 minutes to complete.
ElasticPool
.Update()
.WithDtu(1000)
.Apply();
There's also a ApplyAsync() method which i could deliberately not await to allow the program to continue execution, if i take this approach the program will end execution shortly after calling this line and i am unsure if this library has been designed to work in this fashion.
Does anyone know of a better solution to trigger an update without having to wait on a response? Or if it is safe to fire the async method without waiting for a response?
There is currently no way to make a fire and forget calls in the Fluent SDK for update scenarios but we are looking to the ways of enabling a manual status polling in the future. One option would be to create a thread that will wait on the completion. The other one is to use the Inner getter and make a low level BeginCreateOrUpdateAsync/BeginUpdateAsync method calls and then do manual polls.
On the side note if you need to make multiple calls and then wait for completion of all of them you can use Task.WaitAll(...) and provide the list of the ApplyAsync tasks.
Please log an issue in the repo if you will hit any errors because that way you will be able to track the progress of the fix.
edit: FYI the call is blocking not because SDK is waiting for the response from Azure but that SDK waits until the call is completed, operation of update is finished and the resource is ready to be used for further operations. Just firing an update and then trying to use resource will cause error responses if in your case Elastic Pool is still in the middle of the update.

Resources