Azure CosmosDB: Bulk deletion using SDK - azure

I want to delete 20-30k items in bulk. Currently I am using below method to delete these items. But its taking 1-2 mins.
private async Task DeleteAllExistingSubscriptions(string userUUId)
{
var subscriptions = await _repository
.GetItemsAsync(x => x.DistributionUserIds.Contains(userUUId), o => o.PayerNumber);
if (subscriptions.Any())
{
List<Task> bulkOperations = new List<Task>();
foreach (var subscription in subscriptions)
{
bulkOperations.Add(_repository
.DeleteItemAsync(subscription.Id.ToString(), subscription.PayerNumber).CaptureOperationResponse(subscription));
}
await Task.WhenAll(bulkOperations);
}
}
Cosmos Client:As we can see I have already set AllowBulkExecution = true
private static void RegisterCosmosClient(IServiceCollection serviceCollection, IConfiguration configuration)
{
string cosmosDbEndpoint = configuration["CosmoDbEndpoint"];
Ensure.ConditionIsMet(cosmosDbEndpoint.IsNotNullOrEmpty(),
() => new InvalidOperationException("Unable to locate configured CosmosDB endpoint"));
var cosmosDbAuthKey = configuration["CosmoDbAuthkey"];
Ensure.ConditionIsMet(cosmosDbAuthKey.IsNotNullOrEmpty(),
() => new InvalidOperationException("Unable to locate configured CosmosDB auth key"));
serviceCollection.AddSingleton(s => new CosmosClient(cosmosDbEndpoint, cosmosDbAuthKey,
new CosmosClientOptions { AllowBulkExecution = true }));
}
Is there any way to delete these item in a batch with CosmosDB SDK 3.0 in less time?

Please check the metrics to understand if the volume of data you are trying to send is not getting throttled because your provisioned throughput is not enough.
Bulk just improves the client-side aspect of sending that data by optimizing how it flows from your machine to the account, but if your container is not provisioned to handle that volume of operations, then operations will get throttled and the time it takes to complete will be longer.
As with any data flow scenario, the bottlenecks are:
The source environment cannot process the data as fast as you want, which would show as a bottleneck/spike on the machine's CPU (processing more data would require more CPU).
The network's bandwidth has limitations, in some cases the network has limits on the amount of data it can transfer or even the amount of connections is can open. If the machine you are running the code has such limitations (for example, Azure VMs have SNAT, Azure App Service has TCP limits) and you are running into them, new connections might get delayed and thus increasing latency.
The destination has limits in the amount of operations it can process (in the form of provisioned throughput in this case).

Related

Cosmos DB Query Intermittent latency

I have a singleton Cosmos DB Client running as a singleton with default options. I'm using a .NET 6.0 WebAPI project, running in an Azure app service with "Always-On" enabled. The App Service and Cosmos Account are in the same region, UE2. The API queries a Cosmos container and returns the result.
I've noticed that the latency of the first query is always slow (4-6 seconds), subsequent queries are much faster (-100ms) but also sometimes have random high latency. This is not a cold start scenario, the client has already been initialized by the DI pipeline. I'm also not being rate limited.
Here is my singleton client
public CosmosDbService(IConfiguration configuration)
{
var account = configuration.GetSection("CosmosDb")["Account"];
var key = configuration.GetSection("CosmosDb")["Key"];
var databaseName = configuration.GetSection("CosmosDb")["DatabaseName"];
var containerName = configuration.GetSection("CosmosDb")["Container"];
CosmosClient client = new (account, key);
_myContainer = client.GetContainer(databaseName, containerName);
}
Here is the meat of the query where a Linq query is being passed in:
public class RetrieveCarRepository : IRetrieveCarRepository
{
public async Task<List<CarModel>> RetrieveCars(IQueryable<CarModel> querydef)
{
var query = querydef.ToFeedIterator();
List<CarModel> cars = new ();
while (query.HasMoreResults)
{
var response = await query.ReadNextAsync();
foreach (var car in response)...do a thing
I've been through several Cosmos training videos and cosmos courses but still haven't been able to come to an idea of what is happening.
From the comments.
For query performance using the .NET SDK please see: https://learn.microsoft.com/en-us/azure/cosmos-db/performance-tips-query-sdk?tabs=v3&pivots=programming-language-csharp#use-local-query-plan-generation
Query Plan generation can affect latency and can be avoided if:
The query is reworked to be on a single partition (instead of cross-partition).
The workload runs on Windows, compiled as x64 and with the Nuget DLLs co-located. Which in turn would leverage local query plan generation through the ServiceInterop.dll
On both cases the Query Plan request should be removed and latency improved.
As a general rule, latency should be investigated on the P99 across 1h to understand how it is impacted. A couple of higher latency requests can always happen.
Keep also in mind that query latency will vary based on the type of query, volume of data to transfer, and number of pages. You can capture the Diagnostics and use: https://learn.microsoft.com/azure/cosmos-db/troubleshoot-dot-net-sdk-slow-request

How to have multiple instances of changefeed listeners get the same message: Java

We are using Cosmos Changefeed listeners to update the edge cache in ephemeral java services. That means, all the arbitrary number of instances should receive every changefeed. We used UUID as the "hostname" but not all instances are getting the changefeed. I read somewhere there is leasePrefix. Will that work? If so how to do that on Java side of things?
Yes, Lease prefix will help you in this case. A scenario where you want to do multiple things whenever there is a new event in a particular Azure Cosmos container. If actions you want to trigger, are independent from one another, the ideal solution would be to create one listener for Cosmos DB per action you want to do, all listening for changes on the same Azure Cosmos container.
Given the requirements of the listeners for Cosmos DB, we need a second container to store state, also called, the leases container. Does this mean that you need a separate leases container for each Azure Function?
Here, you have two options:
Create one leases container per Listener: This approach can translate into additional costs, unless you're using a shared throughput database. Remember, that the minimum throughput at the container level is 400 Request Units, and in the case of the leases container, it is only being used to checkpoint the progress and maintain state.
Have one lease container and share it for all your Listeners: This second option makes better use of the provisioned Request Units on the container, as it enables multiple Listeners to share and use the same provisioned throughput.
Here is an example of Function App to implement this in Java Language: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-cosmosdb-v2-trigger?tabs=java
Code for quick reference:
#FunctionName("cosmosDBMonitor")
public void cosmosDbProcessor(
#CosmosDBTrigger(name = "items",
databaseName = "ToDoList",
collectionName = "Items",
leaseCollectionName = "leases",
leaseCollectionPrefix = "prefix",
createLeaseCollectionIfNotExists = true,
connectionStringSetting = "AzureCosmosDBConnection") String[] items,
final ExecutionContext context ) {
context.getLogger().info(items.length + "item(s) is/are changed.");
}

getState() on IBM Blockchain v2: experiencing long load times

I have deployed a single organization network on IBM Blockchain v2. I am experiencing very slow load times (always over 3 seconds for a single asset.)
I have bumped up specs on the Kubernetes cluster. I have also adjusted some of the resource allocations, but the load time didn't budge.
async query(ctx, key) {
console.info('query by key ' + key);
let returnAsBytes = await ctx.stub.getState(key);
console.info(returnAsBytes)
if (!returnAsBytes || returnAsBytes.length === 0) {
return new Error(`${key} does not exist`);
}
let result = JSON.parse(returnAsBytes);
console.info('result of getState: ');
console.info(result);
return JSON.stringify(result);
}
I am wondering if there is a way to get faster results. I also haven't found been able to find many resources on proper deployments for IBM Blockchain v2, so I'm unsure if I'm doing something incorrectly.
Unfortunately you haven't provided enough information, however one area which can have a performance impact is on the client side application where an incorrect pattern for every request used would be
create gateway/connect gateway/submit transaction or evaluate transaction/disconnect gateway.
This jira https://jira.hyperledger.org/projects/FABN/issues/FABN-1319 provides some detail about the gateway lifecycle. But a quick one line suggestion is, don't create gateways all the time, cache them and use a stale policy to disconnect them after a period of non-use. Note gateways are bound to an identity so you would have a gateway for each identity

How to manage centralized values in a sharded environment

I have an ASP.NET app being developed for Windows Azure. It's been deemed necessary that we use sharding for the DB to improve write times since the app is very write heavy but the data is easily isolated. However, I need to keep track of a few central variables across all instances, and I'm not sure the best place to store that info. What are my options?
Requirements:
Must be durable, can survive instance reboots
Must be synchronized. It's incredibly important to avoid conflicting updates or at least throw an exception in such cases, rather than overwriting values or failing silently.
Must be reasonably fast (2000+ read/writes per second
I thought about writing a separate component to run on a worker role that simply reads/writes the values in memory and flushes them to disk every so often, but I figure there's got to be something already written for that purpose that I can appropriate in Windows Azure.
I think what I'm looking for is a system like Apache ZooKeeper, but I dont' want to have to deal with installing the JRE during the worker role startup and all that jazz.
Edit: Based on the suggestion below, I'm trying to use Azure Table Storage using the following code:
var context = table.ServiceClient.GetTableServiceContext();
var item = context.CreateQuery<OfferDataItemTableEntity>(table.Name)
.Where(x => x.PartitionKey == Name).FirstOrDefault();
if (item == null)
{
item = new OfferDataItemTableEntity(Name);
context.AddObject(table.Name, item);
}
if (item.Allocated < Quantity)
{
allocated = ++item.Allocated;
context.UpdateObject(item);
context.SaveChanges();
return true;
}
However, the context.UpdateObject(item) call fails with The context is not currently tracking the entity. Doesn't querying the context for the item initially add it to the context tracking mechanism?
Have you looked into SQL Azure Federations? It seems like exactly what you're looking for:
sharding for SQL Azure.
Here are a few links to read:
http://msdn.microsoft.com/en-us/library/windowsazure/hh597452.aspx
http://convective.wordpress.com/2012/03/05/introduction-to-sql-azure-federations/
http://searchcloudapplications.techtarget.com/tip/Tips-for-deploying-SQL-Azure-Federations
What you need is Table Storage since it matches all your requirements:
Durable: Yes, Table Storage is part of a Storage Account, which isn't related to a specific Cloud Service or instance.
Synchronized: Yes, Table Storage is part of a Storage Account, which isn't related to a specific Cloud Service or instance.
It's incredibly important to avoid conflicting updates: Yes, this is possible with the use of ETags
Reasonably fast? Very fast, up to 20,000 entities/messages/blobs per second
Update:
Here is some sample code that uses the new storage SDK (2.0):
var storageAccount = CloudStorageAccount.DevelopmentStorageAccount;
var table = storageAccount.CreateCloudTableClient()
.GetTableReference("Records");
table.CreateIfNotExists();
// Add item.
table.Execute(TableOperation.Insert(new MyEntity() { PartitionKey = "", RowKey ="123456", Customer = "Sandrino" }));
var user1record = table.Execute(TableOperation.Retrieve<MyEntity>("", "123456")).Result as MyEntity;
var user2record = table.Execute(TableOperation.Retrieve<MyEntity>("", "123456")).Result as MyEntity;
user1record.Customer = "Steve";
table.Execute(TableOperation.Replace(user1record));
user2record.Customer = "John";
table.Execute(TableOperation.Replace(user2record));
First it adds the item 123456.
Then I'm simulating 2 users getting that same record (imagine they both opened a page displaying the record).
User 1 is fast and updates the item. This works.
User 2 still had the window open. This means he's working on an old version of the item. He updates the old item and tries to save it. This causes the following exception (this is possible because the SDK matches the ETag):
The remote server returned an error: (412) Precondition Failed.
I ended up with a hybrid cache / table storage solution. All instances track the variable via Azure caching, while the first instance spins up a timer that saves the value to table storage once per second. On startup, the cache variable is initialized with the value saved to table storage, if available.

Azure: Using System.Diagnostics.PerformanceCounter

I'm aware of the Microsoft.WindowsAzure.Diagnostics performance monitoring. I'm looking for something more real-time though like using the System.Diagnostics.PerformanceCounter
The idea is that a the real-time information will be sent upon a AJAX request.
Using the performance counters available in azure: http://msdn.microsoft.com/en-us/library/windowsazure/hh411520
The following code works (or at least in the Azure Compute Emulator, I haven't tried it in a deployment to Azure):
protected PerformanceCounter FDiagCPU = new PerformanceCounter("Processor", "% Processor Time", "_Total");
protected PerformanceCounter FDiagRam = new PerformanceCounter("Memory", "Available MBytes");
protected PerformanceCounter FDiagTcpConnections = new PerformanceCounter("TCPv4", "Connections Established");
Further down in the MSDN page is another counter I would like to use:
Network Interface(*)\Bytes Received/sec
I tried creating the performance counter:
protected PerformanceCounter FDiagNetSent = new PerformanceCounter("Network Interface", "Bytes Received/sec", "*");
But then I receive an exception saying that "*" isn't a valid instance name.
This also doesn't work:
protected PerformanceCounter FDiagNetSent = new PerformanceCounter("Network Interface(*)", "Bytes Received/sec");
Is using performace counters directly in Azure frowned upon?
The issue you're having here isn't related to Windows Azure, but to performance counters in general. As the name implies, Network Interface(*)\Bytes Received/sec is a performance counter for a specific network interface.
To initialize the performance counter, you'll need to initialize it with the name of the instance (the network interface) you'll want to get the metrics from:
var counter = new PerformanceCounter("Network Interface",
"Bytes Received/sec", "Intel[R] WiFi Link 1000 BGN");
As you can see from the code, I'm specifying the name of the network interface. In Windows Azure you don't control the server configuration (the hardware, the Hyper-V virtual network card, ...), so I wouldn't advise on using the name of the network interface.
That's why it might be safer to enumerate the instance names to initialize the counter(s):
var category = new PerformanceCounterCategory("Network Interface");
foreach (var instance in category.GetInstanceNames())
{
var counter = new PerformanceCounter("Network Interface",
"Bytes Received/sec", instance);
...
}

Resources