Cosmos DB Query Intermittent latency - azure

I have a singleton Cosmos DB Client running as a singleton with default options. I'm using a .NET 6.0 WebAPI project, running in an Azure app service with "Always-On" enabled. The App Service and Cosmos Account are in the same region, UE2. The API queries a Cosmos container and returns the result.
I've noticed that the latency of the first query is always slow (4-6 seconds), subsequent queries are much faster (-100ms) but also sometimes have random high latency. This is not a cold start scenario, the client has already been initialized by the DI pipeline. I'm also not being rate limited.
Here is my singleton client
public CosmosDbService(IConfiguration configuration)
var account = configuration.GetSection("CosmosDb")["Account"];
var key = configuration.GetSection("CosmosDb")["Key"];
var databaseName = configuration.GetSection("CosmosDb")["DatabaseName"];
var containerName = configuration.GetSection("CosmosDb")["Container"];
CosmosClient client = new (account, key);
_myContainer = client.GetContainer(databaseName, containerName);
Here is the meat of the query where a Linq query is being passed in:
public class RetrieveCarRepository : IRetrieveCarRepository
public async Task<List<CarModel>> RetrieveCars(IQueryable<CarModel> querydef)
var query = querydef.ToFeedIterator();
List<CarModel> cars = new ();
while (query.HasMoreResults)
var response = await query.ReadNextAsync();
foreach (var car in response) a thing
I've been through several Cosmos training videos and cosmos courses but still haven't been able to come to an idea of what is happening.

From the comments.
For query performance using the .NET SDK please see:
Query Plan generation can affect latency and can be avoided if:
The query is reworked to be on a single partition (instead of cross-partition).
The workload runs on Windows, compiled as x64 and with the Nuget DLLs co-located. Which in turn would leverage local query plan generation through the ServiceInterop.dll
On both cases the Query Plan request should be removed and latency improved.
As a general rule, latency should be investigated on the P99 across 1h to understand how it is impacted. A couple of higher latency requests can always happen.
Keep also in mind that query latency will vary based on the type of query, volume of data to transfer, and number of pages. You can capture the Diagnostics and use:


How to have multiple instances of changefeed listeners get the same message: Java

We are using Cosmos Changefeed listeners to update the edge cache in ephemeral java services. That means, all the arbitrary number of instances should receive every changefeed. We used UUID as the "hostname" but not all instances are getting the changefeed. I read somewhere there is leasePrefix. Will that work? If so how to do that on Java side of things?
Yes, Lease prefix will help you in this case. A scenario where you want to do multiple things whenever there is a new event in a particular Azure Cosmos container. If actions you want to trigger, are independent from one another, the ideal solution would be to create one listener for Cosmos DB per action you want to do, all listening for changes on the same Azure Cosmos container.
Given the requirements of the listeners for Cosmos DB, we need a second container to store state, also called, the leases container. Does this mean that you need a separate leases container for each Azure Function?
Here, you have two options:
Create one leases container per Listener: This approach can translate into additional costs, unless you're using a shared throughput database. Remember, that the minimum throughput at the container level is 400 Request Units, and in the case of the leases container, it is only being used to checkpoint the progress and maintain state.
Have one lease container and share it for all your Listeners: This second option makes better use of the provisioned Request Units on the container, as it enables multiple Listeners to share and use the same provisioned throughput.
Here is an example of Function App to implement this in Java Language:
Code for quick reference:
public void cosmosDbProcessor(
#CosmosDBTrigger(name = "items",
databaseName = "ToDoList",
collectionName = "Items",
leaseCollectionName = "leases",
leaseCollectionPrefix = "prefix",
createLeaseCollectionIfNotExists = true,
connectionStringSetting = "AzureCosmosDBConnection") String[] items,
final ExecutionContext context ) {
context.getLogger().info(items.length + "item(s) is/are changed.");

Azure CosmosDB: Bulk deletion using SDK

I want to delete 20-30k items in bulk. Currently I am using below method to delete these items. But its taking 1-2 mins.
private async Task DeleteAllExistingSubscriptions(string userUUId)
var subscriptions = await _repository
.GetItemsAsync(x => x.DistributionUserIds.Contains(userUUId), o => o.PayerNumber);
if (subscriptions.Any())
List<Task> bulkOperations = new List<Task>();
foreach (var subscription in subscriptions)
.DeleteItemAsync(subscription.Id.ToString(), subscription.PayerNumber).CaptureOperationResponse(subscription));
await Task.WhenAll(bulkOperations);
Cosmos Client:As we can see I have already set AllowBulkExecution = true
private static void RegisterCosmosClient(IServiceCollection serviceCollection, IConfiguration configuration)
string cosmosDbEndpoint = configuration["CosmoDbEndpoint"];
() => new InvalidOperationException("Unable to locate configured CosmosDB endpoint"));
var cosmosDbAuthKey = configuration["CosmoDbAuthkey"];
() => new InvalidOperationException("Unable to locate configured CosmosDB auth key"));
serviceCollection.AddSingleton(s => new CosmosClient(cosmosDbEndpoint, cosmosDbAuthKey,
new CosmosClientOptions { AllowBulkExecution = true }));
Is there any way to delete these item in a batch with CosmosDB SDK 3.0 in less time?
Please check the metrics to understand if the volume of data you are trying to send is not getting throttled because your provisioned throughput is not enough.
Bulk just improves the client-side aspect of sending that data by optimizing how it flows from your machine to the account, but if your container is not provisioned to handle that volume of operations, then operations will get throttled and the time it takes to complete will be longer.
As with any data flow scenario, the bottlenecks are:
The source environment cannot process the data as fast as you want, which would show as a bottleneck/spike on the machine's CPU (processing more data would require more CPU).
The network's bandwidth has limitations, in some cases the network has limits on the amount of data it can transfer or even the amount of connections is can open. If the machine you are running the code has such limitations (for example, Azure VMs have SNAT, Azure App Service has TCP limits) and you are running into them, new connections might get delayed and thus increasing latency.
The destination has limits in the amount of operations it can process (in the form of provisioned throughput in this case).

Checking connection to Azure Service Bus

I have some code dependent of Azure Service Bus. I've created an endpoint that checks the availability of my Azure Service Bus topic using the following code:
var connectionString = CloudConfigurationManager.GetSetting("servicebusconnectionstring");
var manager = NamespaceManager.CreateFromConnectionString(connectionString);
var sub = manager.GetSubscription("mytopic", "mysubscription");
var count = sub.MessageCount;
This actually works, but I have two questions (since I'm constantly experiencing timeouts using this code).
Question 1: Is there an easier/better way of checking Service Bus connectivity from C#?
Question 2: When using the code above, which instances should I configure as singleton in my IoC container? I'm suspecting creating all instances every time I ping this endpoint to cause the timeout, since I don't see problems in my other endpoints where I re-use a TopicClient.
Getting MessageCount is potentially an expensive operation, especially if the value is high.
You could run a simple operation like a check whether the topic exists:
var ns = NamespaceManager.CreateFromConnectionString("...");
which will throw an exception (probably MessagingCommunicationException) if communication to Service Bus fails.
It's ok to reuse NamespaceManager between requests, so you can make it singleton. Not sure if that brings any measurable performance benefit though.

Azure Table Storage Performance

How fast should I be expecting the performance of Azure Storage to be? I'm seeing ~100ms on basic operations like getEntity, updateEntity, etc.
This guy seems to be getting 4ms which makes it look like something is really wrong here!
I'm using the azure-table-node npm plugin.
A simple getEntity call is taking ~90ms:
exports.get = function(table, pk, rk, callback) {
var start = process.hrtime();
client().getEntity(table, pk, rk, function(err, entity) {
The azure-storage module appears to be even slower:
var start = process.hrtime();
azureClient.retrieveEntity(table, pk, rk, function(err, entity) {
retrieveEntity 174 ms
Well, it really depends from where you are accessing the Azure Storage.
Are you trying to access the storage from the same DataCenter or just from somewhere on the Internet?
If your code is not running in the same DataCenter then it's just a matter of network latency to perform an HttpRequest to DataCenter where you have your storage running. So this can vary a lot, depending from where you're trying to access the DC and in which region your DC is located. (to make an idea you can check the latency from your pc for example to all Azure DCs Storage here:
If you're code is running in the same DC, everything should be pretty fast for simple operations such as the ones you are trying out, probably just a few miliseconds.

How to manage centralized values in a sharded environment

I have an ASP.NET app being developed for Windows Azure. It's been deemed necessary that we use sharding for the DB to improve write times since the app is very write heavy but the data is easily isolated. However, I need to keep track of a few central variables across all instances, and I'm not sure the best place to store that info. What are my options?
Must be durable, can survive instance reboots
Must be synchronized. It's incredibly important to avoid conflicting updates or at least throw an exception in such cases, rather than overwriting values or failing silently.
Must be reasonably fast (2000+ read/writes per second
I thought about writing a separate component to run on a worker role that simply reads/writes the values in memory and flushes them to disk every so often, but I figure there's got to be something already written for that purpose that I can appropriate in Windows Azure.
I think what I'm looking for is a system like Apache ZooKeeper, but I dont' want to have to deal with installing the JRE during the worker role startup and all that jazz.
Edit: Based on the suggestion below, I'm trying to use Azure Table Storage using the following code:
var context = table.ServiceClient.GetTableServiceContext();
var item = context.CreateQuery<OfferDataItemTableEntity>(table.Name)
.Where(x => x.PartitionKey == Name).FirstOrDefault();
if (item == null)
item = new OfferDataItemTableEntity(Name);
context.AddObject(table.Name, item);
if (item.Allocated < Quantity)
allocated = ++item.Allocated;
return true;
However, the context.UpdateObject(item) call fails with The context is not currently tracking the entity. Doesn't querying the context for the item initially add it to the context tracking mechanism?
Have you looked into SQL Azure Federations? It seems like exactly what you're looking for:
sharding for SQL Azure.
Here are a few links to read:
What you need is Table Storage since it matches all your requirements:
Durable: Yes, Table Storage is part of a Storage Account, which isn't related to a specific Cloud Service or instance.
Synchronized: Yes, Table Storage is part of a Storage Account, which isn't related to a specific Cloud Service or instance.
It's incredibly important to avoid conflicting updates: Yes, this is possible with the use of ETags
Reasonably fast? Very fast, up to 20,000 entities/messages/blobs per second
Here is some sample code that uses the new storage SDK (2.0):
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount;
var table = storageAccount.CreateCloudTableClient()
// Add item.
table.Execute(TableOperation.Insert(new MyEntity() { PartitionKey = "", RowKey ="123456", Customer = "Sandrino" }));
var user1record = table.Execute(TableOperation.Retrieve<MyEntity>("", "123456")).Result as MyEntity;
var user2record = table.Execute(TableOperation.Retrieve<MyEntity>("", "123456")).Result as MyEntity;
user1record.Customer = "Steve";
user2record.Customer = "John";
First it adds the item 123456.
Then I'm simulating 2 users getting that same record (imagine they both opened a page displaying the record).
User 1 is fast and updates the item. This works.
User 2 still had the window open. This means he's working on an old version of the item. He updates the old item and tries to save it. This causes the following exception (this is possible because the SDK matches the ETag):
The remote server returned an error: (412) Precondition Failed.
I ended up with a hybrid cache / table storage solution. All instances track the variable via Azure caching, while the first instance spins up a timer that saves the value to table storage once per second. On startup, the cache variable is initialized with the value saved to table storage, if available.
