How to get better performance with Azure ServiceBus Standard plan - azure

I don't manage to get over 14 msg/second with the Azure ServiceBus Standard Plan. I'm running some benchmark tests with the Azure-Sample tool that I found in this question:
The test is done with a ServiceBus resource with a single Queue and all default configurations:

If I read this correctly, you've got the maximum concurrency of one (MaxInflightReceives) with 5 receivers (ReceiverCount). Increasing concurrency and enabling prefetch on the clients will increase the overall throughput. But,
Testing should be done within the same Azure data centre. If you're testing from a local machine, you're introducing a substantial latency that cannot be avoided.
The receive mode used is PeekLock. It is slower than ReceiveAndDelete. Not suggesting to switch, but this needs to be taken into consideration as you're trading throughput for safety by using PeekLock.
The standard tier has a cap on the number of operations per second. In addition to that, your namespace is deployed in a shared environment with entities scattered in various deployment containers. Performance will vary and cannot be guaranteed. If you want to have a guaranteed throughput, use Premium SKU.

Related

How to set a memory limitation per request on Azure Function?

I created an Azure Function App(.Net 6 Isolated) utilizing the Consumption plan, which is responsible for converting various documents from one format to another, such as converting PDFs to PNGs. However, the processing time for certain documents may be longer due to factors such as the size of the document. I am aware that the Consumption plan has a memory limitation of 1.5 GB per function app. There are two function endpoints on the app, and I would like to set a hard limit on the memory usage per request to ensure that it does not exceed 512 MB. Is this possible?
But the MemoryFailPoint class does not guarantee that the block of code will execute within a specific amount of memory. It only ensures that a certain amount of memory is available before executing the code
This functionality of setting the memory consumption size is available for the Azure Functions before the Year of 2016.
There have been few changes in the Serverless design especially Azure Functions on utilizing of the dependent resources.
Microsoft has disabled the memory setting in Consumption Plan based on the experience feedback from many of Azure Users and brought up the change that the Consumption Hosting Plan will decides the resource utilization including memory/CPU based on your usage of Functions.
Refer to this MS Article for more information on memory settings to each of our function apps.

What is the optimal architecture design on Azure for an infrequently used backend that needs a robust configuration?

I'm trying to find the optimal cloud architecture to host a software on Microsoft Azure.
The scenario is the following:
A (containerised) REST API is exposed to the users through which they can submit POST and GET requests. POST requests trigger a backend that needs a robust configuration to operate properly and GET requests are sent to fetch the result of the backend, if any. This component of the solution is currently hosted on an Azure Web App Service which does the job perfectly.
The (containerised) backend (triggered by POST requests) perform heavy calculations during a short amount of time (typically 5-10 minutes are allotted for the calculation). This backend needs (at least) 4 cores and 16 Gb RAM, but the more the better.
The current configuration consists in the backend hosted together with the REST API on the App Service with a plan that accommodates the backend's requirements. This is clearly not very cost-efficient, as the backend is idle ~90% of the time. On top of that it's not really scalable despite an automatic scaling rule to spawn new instances based on the CPU use: it's indeed possible that if several POST requests come at the same time, they are handled by the same instance and make it crash due to a lack of memory.
Azure Functions doesn't seem to be an option: the serverless (consumption plan) solution they propose is restricted to 1.5 Gb RAM and doesn't have Docker support.
Azure Container Instances neither, because first the max number of CPUs is 4 (which is really few for the needs here, although acceptable) and second there are cold starts of approximately 2 minutes (I imagine due to the creation of the container group, pull of the image, and so on). Despite the process is async from a user perspective, a high latency is not allowed as the result is expected within 5-10 minutes, so cold starts are a problem.
Azure Batch, which at first glance appears to be a perfect fit (beefy configurations available, made for hpc, cost effective, made for time limited tasks, ...) seems to be slow too (it takes a couple of minutes to create a pool and jobs don't run immediately when submitted).
Do you have any idea what I could use?
Thanks in advance!
Azure Functions
You could look at Azure Functions Elastic Premium plan. EP3 has 4 cores, 14GB of RAM and 250GB of storage.
Premium plan hosting provides the following benefits to your functions:
Avoid cold starts with perpetually warm instances
Virtual network connectivity.
Unlimited execution duration, with 60 minutes guaranteed.
Premium instance sizes: one core, two core, and four core instances.
More predictable pricing, compared with the Consumption plan.
High-density app allocation for plans with multiple function apps.
https://learn.microsoft.com/en-us/azure/azure-functions/functions-premium-plan?tabs=portal
Batch Considerations
When designing an application that uses Batch, you must consider the possibility of Batch not being available in a region. It's possible to encounter a rare situation where there is a problem with the region as a whole, the entire Batch service in the region, or your specific Batch account.
If the application or solution using Batch always needs to be available, then it should be designed to either failover to another region or always have the workload split between two or more regions. Both approaches require at least two Batch accounts, with each account located in a different region.
https://learn.microsoft.com/en-us/azure/batch/high-availability-disaster-recovery

What is a Unit in terms of Azure Signal R Service?

So I've been going through Azure Signal R Service for blazor apps and I've noticed they have their pricing according to units as well. The free version allows up to one unit where as the standard version has up to 100 units. I'm currently clueless as to what a "Unit" is, with this regard so it would be nice if someone would be kind enough to give a brief explanation on this. P.s: I am relatively new to Blazor however I have experience with .Net Core & Asp.Net Mvc .
A unit is a sub-instance that processes your SignalR messages. Units are used to increase the performance and connections count.
An instance is what you need to create first to use SignalR.
Think unit this way: Let’s say you have a web server that is not enough to handle the web traffic. You can add two more servers to load balance the traffic. This increases the performance and number of requests your environment can handle. In this example, the environment is an INSTANCE. Each server is a UNIT. Before adding new servers, you have 1 instance and 1 unit in that instance. After adding new servers, you have 1 instance and 3 units in that instance.
SignalR Pricing
In FREE plan, you can use only 1 unit and this unit can handle maximum 20 concurrent connections
In STANDARD plan, you can use 100 units. Each unit can handle 1,000 concurrent connections
(Please note the difference: The unit in FREE plan supports maximum 20 connections while a unit in STANDARD plan supports 1,000 connections. In terms of pricing, FREE plan unit and STANDARD plan unit are not the same)
Source: What is the difference between SignalR unit and instance? How SignalR pricing works?
Azure SignalR Unit has to be thought as a nodes available for processing messages for you app.
As you can see on the screenshot below, you can only select multiple units when using the "Standard" pricing tier (the free tier only allows one Unit with limited throughput).
When you select the Standard tier, you can then add up to 100 Units, which theoretically can allow you to
handle 1000 connections per Unit (with 100 Units, then 100,000 connections),
manage 1 million messages per day (with 100 Units, then 100 million connections).
You can scale up to you needs anytime, all depends on your app!

What would cause high KUDU usage (and eventual 502 errors) on an Azure App Service Plan?

We have a number of API apps and WebApps on an Azure App Service P2v2 instance. We've been experiencing an amount of platform instability: the App Service becomes unhealthy and we get a rash of 502 errors across various of the Apps (different ones each time), attributable to very high CPU and Memory usage on the app service. We've tried scaling all the way up to P3v2, but whatever the issue is seems eventually to consume all resources available.
Whenever we've been able to trace a culprit among the apps, it has turned dout not to be the app itself but the Kudu service related to it.
A sample error message is High physical memory usage detected on multiple occasions. The kudu process for the app [sitename]'pe-services-color' is the most common cause of high memory usage. The most common cause of high memory usage for the kudu process is web jobs. where the actual app whose Kudu service is named changes quite frequently.
What could be causing the Kudu services to consume so much CPU/Memory, and what can we do to stabilise this app service?
Is it simply that we have too many apps running on one plan? This seems unlikely since all these apps ran previously on a single classic cloud service instance, but if so, what are the limits for apps and slots on a single plan?
(I have seen this question but the answer doesn't help)
Update
From Azure support, these are apparently the limits on Small - Medium - Large non-shared app services:
Worker Size Max sites
Small 5 Medium 10 Large 20
with 'sites' comprising app services/api apps and their slots.
They seem ridiculously low, and make the larger App Service units highly uneconomic. Can anyone confirm these numbers?
(Incidentally, we found that turning off Always On across the board fixed the issue - it was only causing a problem on empty sites though - we haven't had a chance yet to see if performance is good with all the sites filled.)
High CPU and memory utilization would be mostly caused by your program/code itself. If there are lot of CPU intensive tasks and you applied lot of parallel programming that spawn lot of new threads can contribute to high cpu and memory utilization. So review your code and see such instances. When number of parallel threads increased cpu utilization goes high and it starts scaling up frequently that adds up your cost also sometime thread loss and unexpected results. As Azure resources costs are high you need to plan your performance accordingly.
You can monitor this using the Metrics option of the app service plan in the blade .

Deleting items from Azure queue painfully slow

My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.

Resources