Azure Function batchSize - azure

I am wondering about the parallel principle in Azure Function. If I have a batchSize of 32 and a threshold of 16. If the queue grow to large the Scale controller spins up a new function to withstand the pressure. I understand this bit. What I don't understand is: does a single instance work on the batch? That is, do I only have one function running pr batch, or does the runtime scale out and run multiple threads with the function?
Could I risk having two instances running, each with a 32 messages, and concurrently 32 threads running 32 functions pr once?
Imaging I have a function calling a webapi. This means that the api will get 64 calls at once which I don't want.
What I want is 2 functions working on 32 messages each making 1 call pr message pr function.
I hope you guys understand.

Yes. That is indeed how scaling works. The same is explained with a bit more details in the docs as well.
According to that, your function (one instance) could run up to 48 messages at a time (32 from a new batch + 16 from the existing batch) and could potentially scale to multiple instances depending on the queue length.
To achieve the scenario you've mentioned, you would have to
Set the batchSize to 1 to avoid parallel processing per instance
Set the WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT app setting to 2 to limit scale out to a max of 2 instances
Note that all 32 messages won't be loaded by either instance but will work through the queue nonetheless.

Related

Azure (Durable) Functions - Managing parallelism

I'm posting this question to see if I'm understanding parallelism in Azure Functions correctly, and particularly Durable Functions.
The ability to set max degree of parallelism was recently added to Azure Functions using az cli:
https://github.com/Azure/azure-functions-host/issues/1207
az resource update --resource-type Microsoft.Web/sites -g <resource_group> -n <function_app_name>/config/web --set properties.functionAppScaleLimit=<scale_limit>
I've applied this to my Function App, but what I'm unsure of is how this plays with the MaxConcurrentOrchestratorFunctions and MaxConcurrentActivityFunctions settings for Durable Functions.
Would the below lead to a global max of 250 concurrent activity functions?
functionAppScaleLimit: 5
MaxConcurrentOrchestratorFunctions: 5
MaxConcurrentActivityFunctions: 10
Referring to the link you shared to limit scaling this functionAppScaleLimit will help you to specify the maximum number of instances for your function. Now coming to MaxConcurrentOrchestratorFunctions : sets the maximum number of orchestrator functions that can be processed concurrently on a single host instance and MaxConcurrentActivityFunctions the maximum number of activity functions that can be processed concurrently on a single host instance. Refer to this
Now, I am explaining what MaxConcurrentOrchestratorFunctions does , which would help you understand how it works:
MaxConcurrentOrchestratorFunctions controls how many orchestrator functions can be loaded into memory at any given time. If you set concurrency to 1 and then start 10 orchestrator functions, only one will be loaded in memory at a time. Remember that if an orchestrator function calls an activity function, the orchestrator function will unload from memory while it waits for a response. During this time, another orchestrator function may start. The effect is that you will have as many as 10 orchestrator functions running in an interleaved way, but only 1 should actually be executing code at a time.
The motivation for this feature is to limit CPU and memory used by orchestrator code. It's not going to be useful for implementing any kind of singleton pattern. If you want to limit the number of active orchestrations, then you will need to implement this.
Your global max of activity functions would be 50. This is based on 5 app instances as specified by functionAppScaleLimit and 10 activity functions as specified by MaxConcurrentActivityFunctions. The relationship between the number of orchestrator function executions and activity function executions depends entirely on your specific implementation. You could have 1-1,000 orchestration(s) that spawn 1-1,000 activities. Regardless, the settings you propose will ensure there are never more than 5 orchestrations and 10 activities running concurrently on a single function instance.

Why is Python consistently struggling to keep up with constant generation of asyncio tasks?

I have a Python project with a server that distributes work to one or more clients. Each client is given a number of assignments which contain parameters for querying a target API. This includes a maximum number of requests per second they can make with a given API key. The clients process the response and send the results back to the server to store into a database.
Both the server and clients use Tornado for asynchronous networking. My initial implementation for the clients relied on the PeriodicCallback to ensure that n-number of calls to the API would occur. I thought that this was working properly as my tests would last 1-2 minutes.
I added some telemetry to collect statistics on performance and noticed that the clients were actually having issues after almost exactly 2 minutes of runtime. I had set the API requests to 20 per second (the maximum allowed by the API itself) which the clients could reliably hit. However, after 2 minutes performance would fluctuate between 12 and 18 requests per second. The number of active tasks steadily increased until it hit the maximum amount of active assignments (100) given from the server and the HTTP request time to the API was reported by Tornado to go from 0.2-0.5 seconds to 6-10 seconds. Performance is steady if I only do 14 requests per second. Anything higher than 15 requests will experience issues 2-3 minutes after starting. Logs can be seen here. Notice how the column of "Active Queries" is steady until 01:19:26. I've truncated the log to demonstrate
I believed the issue was the use of a single process on the client to handle both communication to the server and the API. I proceeded to split the primary process into several different processes. One handles all communication to the server, one (or more) handles queries to the API, another processes API responses into a flattened class, and finally a multiprocessing Manager for Queues. The performance issues were still present.
I thought that, perhaps, Tornado was the bottleneck and decided to refactor. I chose aiohttp and uvloop. I split the primary process in a similar manner to that in the previous attempt. Unfortunately, performance issues are unchanged.
I took both refactors and enabled them to split work into several querying processes. However, no matter how much you split the work, you still encounter problems after 2-3 minutes.
I am using both Python 3.7 and 3.8 on MacOS and Linux.
At this point, it does not appear to be a limitation of a single package. I've thought about the following:
Python's asyncio library cannot handle more than 15 coroutines/tasks being generated per second
I doubt that this is true given that different libraries claim to be able to handle several thousand messages per second simultaneously. Also, we can hit 20 requests per second just fine at the start with very consistent results.
The API is unable to handle more than 15 requests from a single client IP
This is unlikely as I am not the only user of the API and I can request 20 times per second fairly consistently over an extended period of time if I over-subscribe processes to query from the API.
There is a system configuration causing the limitation
I've tried both MacOS and Debian which yield the same results. It's possible that's it a *nix problem.
Variations in responses cause a backlog which grows linearly until it cannot be tackled fast enough
Sometimes responses from the API grow and shrink between 0.2 and 1.2 seconds. The number of active tasks returned by asyncio.all_tasks remains consistent in the telemetry data. If this were true, we wouldn't be consistently encountering the issue at the same time every time.
We're overtaxing the hardware with the number of tasks generated per second and causing thermal throttling
Although CPU temperatures spike, neither MacOS nor Linux report any thermal throttling in the logs. We are not hitting more than 80% CPU utilization on a single core.
At this point, I'm not sure what's causing it and have considered refactoring the clients into a different language (perhaps C++ with Boost libraries). Before I dive into something so foolish, I wanted to ask if I'm missing something simple.
Conclusion
Performance appears to vary wildly depending on time of day. It's likely to be the API.
How this conclusion was made
I created a new project to demonstrate the capabilities of asyncio and determine if it's the bottleneck. This project takes two websites, one to act as the baseline and the other is the target API, and runs through different methods of testing:
Spawn one process per core, pass a semaphore, and query up to n-times per second
Create a single event loop and create n-number of tasks per second
Create multiple processes with an event loop each to distribute the work, with each loop performing (n-number / processes) tasks per second
(Note that spawning processes is incredibly slow and often commented out unless using high-end desktop processors with 12 or more cores)
The baseline website would be queried up to 50 times per second. asyncio could complete 30 tasks per second reliably for an extended period, with each task completing their run in 0.01 to 0.02 seconds. Responses were very consistent.
The target website would be queried up to 20 times per second. Sometimes asyncio would struggle despite circumstances being identical (JSON handling, dumping response data to queue, returning immediately, no CPU-bound processing). However, results varied between tests and could not always be reproduced. Responses would be under 0.4 seconds initially but quickly increase to 4-10 seconds per request. 10-20 requests would return as complete per second.
As an alternative method, I chose a parent URI for the target website. This URI wouldn't have a large query to their database but instead be served back with a static JSON response. Responses bounced between 0.06 seconds to 2.5-4.5 seconds. However, 30-40 responses would be completed per second.
Splitting requests across processes with their own event loop would decrease response time in the upper-bound range by almost half, but still took more than one second each to complete.
The inability to reproduce consistent results every time from the target website would indicate that it's a performance issue on their end.

Azure Function with Java: configure batchSize and newBatchThreshold efficiently

I'm considering to use such a solution when Function is triggered by Queue on Java. I'm trying to understand how to configure batchSize and newBatchThreshold more efficiently. I would like to mention below what I managed to find out about it. Please correct me as soon as you find a mistake in my reasoning:
Function is executed on 1 CPU-core environment;
Function polls messages from Queue in batches with size 16 by default and executes them in parallel (right from the documentation);
so I make a conclusion that:
if messages need CPU-intensive tasks - they are executed sequentially;
so I make a conclusion that:
since processing of messages starts at the same time (when batch arrived) then processing of more last messages takes longer and longer (confirmed experimentally);
all these longer and longer processings are billable (despite Function's body execution lasts 10 times less);
so I make a conclusion that:
One should set both batchSize and newBatchThreshold to 1 for CPU-intensive tasks and can vary only for non-CPU intensive tasks (looks like only IO-intensive tasks).
Does it make sense?

Performance issue while using Parallel.foreach() with MaximumDegreeOfParallelism set as ProcessorCount

I wanted to process records from a database concurrently and within minimum time. So I thought of using parallel.foreach() loop to process the records with the value of MaximumDegreeOfParallelism set as ProcessorCount.
ParallelOptions po = new ParallelOptions
{
};
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(listUsers, po, (user) =>
{
//Parallel processing
ProcessEachUser(user);
});
But to my surprise, the CPU utilization was not even close to 20%. When I dig into the issue and read the MSDN article on this(http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx), I tried using a specific value of MaximumDegreeOfParallelism as -1. As said in the article thet this value removes the limit on the number of concurrently running processes, the performance of my program improved to a high extent.
But that also doesn't met my requirement for the maximum time taken to process all the records in the database. So I further analyzed it more and found that there are two terms as MinThreads and MaxThreads in the threadpool. By default the values of Min Thread and MaxThread are 10 and 1000 respectively. And on start only 10 threads are created and this number keeps on increasing to a max of 1000 with every new user unless a previous thread has finished its execution.
So I set the initial value of MinThread to 900 in place of 10 using
System.Threading.ThreadPool.SetMinThreads(100, 100);
so that just from the start only minimum of 900 threads are created and thought that it will improve the performance significantly. This did create 900 threads, but it also increased the number of failure on processing each user very much. So I did not achieve much using this logic. So I changed the value of MinThreads to 100 only and found that the performance was much better now.
But I wanted to improve more as my requirement of time boundation was still not met as it was still exceeding the time limit to process all the records. As you may think I was using all the best possible things to get the maximum performance in parallel processing, I was also thinking the same.
But to meet the time limit I thought of giving a shot in the dark. Now I created two different executable files(Slaves) in place of only one and assigned them each half of the users from DB. Both the executable were doing the same thing and were executing concurrently. I created another Master program to start these two Slaves at the same time.
To my surprise, it reduced the time taken to process all the records nearly to the half.
Now my question is as simple as that I do not understand the logic behind Master Slave thing giving better performance compared to a single EXE with all the logic same in both the Slaves and the previous EXE. So I would highly appreciate if someone will explain his in detail.
But to my surprise, the CPU utilization was not even close to 20%.
…
It uses the Http Requests to some Web API's hosted in other networks.
This means that CPU utilization is entirely the wrong thing to look at. When using the network, it's your network connection that's going to be the limiting factor, or possibly some network-related limit, certainly not CPU.
Now I created two different executable files … To my surprise, it reduced the time taken to process all the records nearly to the half.
This points to an artificial, per process limit, most likely ServicePointManager.DefaultConnectionLimit. Try setting it to a larger value than the default at the start of your program and see if it helps.

Windows Azure worker roles: One big job or many small jobs?

Is there any inherent advantage when using multiple workers to process pieces of procedural code versus processing the entire load?
In other words, if my workflow looks like this:
Get work from queue0 and do A
Store result from A in queue1
Get result from queue 1 and do B
Store result from B in queue2
Get result from queue2 and do C
Is there an inherent advantage to using 3 workers who each do the entire process themselves versus 3 workers that each do a part of the work (Worker 1 does 1 & 2, worker 2 does 3 & 4, worker 3 does 5).
If we only care about working being done (finished with step 5) it would seem that it scales the same way (once you're using at least 3 workers). Maybe the big job is better because workers with that setup have less bottleneck issues?
In general, the smaller the jobs are, the less work you lose when some process crashes. Also, the smaller the jobs are, the more evenly you'll be able to distribute the work. (Instead of at one point having a single worker instance doing a long job and all the others idle, you'd have all the worker instances doing small pieces of work.)
Setting aside how to break up the work into smaller pieces, there's a question of whether there should be multiple worker roles, each of which can only do one kind of work, or a single worker role (but many instances) that can do everything. I would default to the latter (code that can do everything and just checks all the queues to see what needs to be done), but there are reasons to go with the former. If you need more RAM for one kind of work, for example, you might use a bigger VM size for that worker. Another example is if you wanted to scale the different kinds of work independently.
Adding to what #smarx says:
The model of a "multipurpose" worker is of course more general. So even if you require specialized types (like the extra RAM example used above) you would simply have a single task in that particular role.
There's the extra perspective of cost. You will have an economic incentive to increase the "task density" (as in tasks/instance). If you have M types of work and you assign each one to a different worker, then you will pay for M instances, even if some those might only do some work every once in a while.
I blogged about this some time ago and it is one topic of our guide (chapter "06 week3.docx")
Many frameworks and samples (including ours) use this approach.

Resources