I'm working on a demo for azure functions using queue triggers. I created a recursive Sudoku solver to show how to take depth first search and convert to using queued recursion. The code is on github.
I was expecting it to scale out and process an insane number of messages per second, but it is barely processing 30/s. The queue is filling up and the utilization seems minimal.
How can I get better performance from this? I tried increasing the batch size in the host.json, but didn't seem to help. I have over 200k messages in the queue and it's growing.
Update 1
I tried setting the host.json file as
{
"queues": {
"visibilityTimeout": "00:00:10",
"batchSize": 32,
"maxDequeueCount": 5,
"newBatchThreshold": 100
}
}
but request per second remained the same.
I deployed the same function to another instance, but tied it to S4 service plan. This is able to process about 64 requests per second, but still seems slow.
I can serial process the messages locally way faster than this.
Update 2
I scaled the S4 to 10 instances and each instance is handling about 60-70 requests per second. But that's insanely expensive to still not be able to process as fast as I can with a single core locally. The queue used with the service plan functions has 500k messages piled up.
Azure functions do not listen for an item to be added to a queue, they actually pole the queue using a polling algorithm which you can over ride with the maxPollingInterval property.
Adding "maxPollingInterval": "00:00:01" to the options you have already mentioned above should solve your problem.
maxPollingInterval azure documentaiton
Related
I am using the Azure service bus queue for one of my requirements. The requirement is simple, an azure function will act as an API and creates multiple jobs in the queue. The function is scalable and on-demand new instance creation. The job which microservice creates will be processed by a windows service. So the sender is Azure function and the receiver is windows service. Since the azure function is scalable, there will be multiple numbers of functions will be executed in parallel. So, the number of jobs getting created into the queue will be in parallel, and probably one job in every 500MS. Windows service is a single instance that is a Queue listener listens to this Queue and executes in parallel. So, the number of senders might be more, the receiver is one instance. And each job can run in parallel must be limited(4, since it takes more time and CPU) Right now, I am using Aure Service Bus Queue with the following configuration. My doubt is which configuration produces the best performance for this particular requirement.
The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created. My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).
The deletion of the Job in the queue will not be an issue for me. So, Can I use Delete instead of Peek-Lock?
Receiving messages in PeekLock receive mode will less performant than ReceiveAndDelete. You'll be saving roundtrips to the broker to complete messages.
Max delivery count: 4, Message lock duration: 5 min, MaxConcurrentCalls: 4 (In listener). I am new to the service bus, I need a suggestion for this.
MaxDeliveryCount is how many times a message can be attempted before it's dead-lettered. It appears to be equal to the number of cores, but it shouldn't. Could be just a coincidence.
MessageLockDuration will only matter if you use PeekLock receive mode. For ReceiveAndDelete it won't matter.
As for Concurrency, even though your work is CPU bound, I'd benchmark if higher concurrency would be possible.
An additional parameter on the message receiver to look into would be PrefetchCount. It can improve the overall performance by making fewer roundtrips to the broker.
One more doubt is, let's consider the listener got 4 jobs in parallel and start execution. One job completed its execution and became a completed status. So the listener will pick the next item immediately or wait for all the 4 jobs to be completed (MaxConcurrentCalls: 4).
The listener will immediately start processing the 5th message as your concurrency is set to 4 and one message processing has been completed.
Also, right now, the number of items receiving by the listener is not in order. I want to maintain an order in which it got created.
To process messages in the order they were sent in you will need to send and receive messages using sessions.
My requirement is maximum performance. The job is done by the windows service is a CPU intensive task, that's why I have limited to 4 since the system is a 4 Core.
There are multiple things to take into consideration. The location of your Windows Service location would impact the latency and message throughput. Scaling out could help, etc.
So suppose that you have an application that lets user request a job. For example (hypothetical): user uploads a video. There is an entry made in RDBMs with the URL to video on blob and the status is set to "Pending".
There is a recurring time triggered functionapp that is executed every 10 seconds or so which gets 10 pending jobs from RDBMS and performs some compression etc.
The problem here is that as long as the number of requests stay 10-30 videos per 10 seconds we should be fine. But if the number of requests increase all of a sudden .. say 200 requests per 10 seconds this would mean that there will be a lot of job pending and the user would have to wait 10 times longer than usual to see status change. How do you scale out function app automatically in such scenario? Does it have to be manual?
There's an easier way to get fan out and parallel processing through multiple concurrently running Azure Functions.
Add an Azure Service Bus Queue to your solution.
For each video that needs to be processed, enqueue a service bus message with the appropriate data you'll need to retrieve and process the video (like the BlobId).
Have your Azure Function triggered by an ServiceBusTrigger.
Azure will spin up additional instances of your Azure Function as the queue depth increases. It'll also scale in idle instances after there's no more data to process.
I have an Azure Function that has EventHub trigger, with Consumption plan. In my test I shoot 3000 events to event hub using in a few batches. Since time for those 3000 events was almost 10 times bigger than the time for 300 events, I suspected that this Azure Function didn't scale to multiple VMs/instances.
To verify that hypothesis, I used a Guid static variable, which I initialized once and logged in every run of the function. All 3000 runs logged the same Guid.
That happens even if I specify following configuration in host.json:
"eventHub": {
"maxBatchSize": 1,
"prefetchCount": 10
}
Logic was that this would limit parallel processing within single instance and multiple instances would be started because of that, but again only 1 Guid is logged.
As a note, this is not the only function in App Service. Could that be the issue? What is the condition that needs to be satisfied so that Function is started on multiple VMs?
Edit:
I have 32 partitions and 20 throughput units. First issue was that I was using SendBatchAsync, which doesn't partition events. Even SendAsync didn't bring any scale, like it wasn't partitioning. So I created partitioned eventhub senders and did round robin partitioning when sending events in client application.
That increased number of events processed by AzureFunction, but still didn't create more than 1 VM.
Furthermore, number of events processed per second was much larger in the beginning (~200 in each moment), and after 2000 events, or near the end, they dropped to ~5. This has nothing to do with load of the system, as same behavior was observed with 9000 events, where slowing down happened after ~5k events.
This Azure function lasts 50-250 ms, depending on the load.
It also sends event to another Azure function through Azure Storage Queue trigger. What is interesting is that neither that function which is triggered by Queue trigger scales to more than 1 VM, and it has ~1k messages in queue at the beginning, before slowness of eventhub triggered azure function. Queue settings in host.json are "queues": {
"maxPollingInterval": 2000,
"visibilityTimeout" : "00:00:10",
"batchSize": 32,
"maxDequeueCount": 5,
"newBatchThreshold": 1
}
Thanks.
It depends on a few factors:
the number of partitions your event hub has and whether the events you are writing are being distributed across your partitions. Azure Functions uses Event Processor Host to process your workload and the maximum scale you can get in this mode is one VM per partition.
the per-event workload you're executing. For example if your function does nothing but log, those 3000 events could be processed in less than 5 seconds on a single VM. This would not warrant scaling your application onto multiple instances.
However if you're writing a batch of events across several partitions which takes several minutes in total to process and you don't see your throughput accelerating as your function scales up then that could indicate that something is not working right and would warrant further investigation.
I have an azure storage queue that has over 100,000 queue items on it. The average processing time is about 1 minute to complete each item (as reported in the WebJob dashboard).
I have set the max batch size for my webJob to be 32 like this:
JobHostConfiguration config = new JobHostConfiguration();
config.Queues.BatchSize = 32;
var host = new JobHost(config);
// The following code ensures that the WebJob will be running continuously
host.RunAndBlock();
If I set it any higher than 32 the webjob won't start and keeps flipping between (pending restart and starting) so I assume 32 is the max batch size.
However, my app service plan is running with a cool 4% CPU utilization. I have enabled auto-scale based on CPU usage.
What I want to do is figure out how to make the web job do more tasks in parallel so it can start using more of that CPU usage if it needs it and hopefully cause it to auto scale and then process more. What levers can I pull to make my WebJob take better advantage of my App Service Plan instances?
Note that the BatchSize maximum of 32 is a limit imposed by Azure Queues that the WebJobs SDK doesn't control. A single queue listener can only pull a maximum of 32 messages at a time because that’s all queues allow. That's why your job is not starting properly when you set it greater than 32 - if you check your error logs you should see an error to that effect.
However, there is a second config knob that relates to parallel throughput that you can also configure. See config.Queues.NewBatchThreshold. This value defaults to half the BatchSize when not explicitly set. Basically, this setting is the threshold that governs when a new batch will be fetched. So if you increase this value (say setting it to 100), more queue messages will be processed in parallel. If set to 100, when the number of messages being processed dips below 100, a new batch will be fetched.
You can also further increase throughput by scaling out your job to multiple instances. I recommend trying the NewBatchThreshold setting first and see where that gets you.
This comment in the code explains the situation:
// Azure Queues currently limits the number of messages retrieved to 32. We enforce this constraint here because
// the runtime error message the user would receive from the SDK otherwise is not as helpful.
private const int MaxBatchSize = 32;
More information about this can be found on https://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-queues/:
There are two ways you can customize message retrieval from a queue. First, you can get a batch of messages (up to 32). [etc...]
So that's where this limit is coming from. However, I'm thinking that the WebJobs SDK could theoretically process multiple queue batches at the same time, so it doesn't have to be bound to this Storage Queue limitation. That's something that you should bring up on https://github.com/Azure/azure-webjobs-sdk/issues for further discussion to see what can be done. But as it stands, that is indeed the limitation.
I have a service which polls a queue very quickly to check for more 'work' which needs to be done. There is always more more work in the queue than a single worker can handle. I want to make sure a single worker doesn't grab too much work when the service is already at max capacity.
Let say my worker grabs 10 messages from the queue every N(ms) and uses the Parallel Library to process each message in parallel on different threads. The work itself is very IO heavy. Many SQL Server queries and even Azure Table storage (http requests) are made for a single unit of work.
Is using the TheadPool.GetAvailableThreads() the proper way to throttle how much work the service is allowed to grab?
I see that I have access to available WorkerThreads and CompletionPortThreads. For an IO heavy process, is it more appropriate to look at how many CompletionPortThreads are available? I believe 1000 is the number made available per process regardless of cpu count.
Update - Might be important to know that the queue I'm working with is an Azure Queue. So, each request to check for messages is made as an async http request which returns with the next 10 messages. (and costs money)
I don't think using IO completion ports is a good way to work out how much to grab.
I assume that the ideal situation is where you run out of work just as the next set arrives, so you've never got more backlog than you can reasonably handle.
Why not keep track of how long it takes to process a job and how long it takes to fetch jobs, and adjust the amount of work fetched each time based on that, with suitable minimum/maximum values to stop things going crazy if you have a few really cheap or really expensive jobs?
You'll also want to work out a reasonable optimum degree of parallelization - it's not clear to me whether it's really IO-heavy, or whether it's just "asynchronous request heavy", i.e. you spend a lot of time just waiting for the responses to complicated queries which in themselves are cheap for the resources of your service.
I've been working virtually the same problem in the same environment. I ended up giving each WorkerRole an internal work queue, implemented as a BlockingCollection<>. There's a single thread that monitors that queue - when the number of items gets low it requests more items from the Azure queue. It always requests the maximum number of items, 32, to cut down costs. It also has automatic backoff in the event that the queue is empty.
Then I have a set of worker threads that I started myself. They sit in a loop, pulling items off the internal work queue. The number of worker threads is my main way to optimize the load, so I've got that set up as an option in the .cscfg file. I'm currently running 35 threads/worker, but that number will depend on your situation.
I tried using TPL to manage the work, but I found it more difficult to manage the load. Sometimes TPL would under-parallelize and the machine would be bored, other times it would over-parallelize and the Azure queue message visibility would expire while the item was still being worked.
This may not be the optimal solution, but it seems to be working OK for me.
I decided to keep an internal counter of how many message are currently being processed. I used Interlocked.Increment/Decrement to manage the counter in a thread-safe manner.
I would have used the Semaphore class since each message is tied to its own Thread but wasn't able to due to the async nature of the queue poller and the code which spawned the threads.