The scaling documentation for Azure Functions is a bit light on details for how Azure Functions decide when to add more instances of an app.
Say for example I have a function that is triggered by a Github webhook. 10,000 people simultaneously commit to the Github repo (with no merge conflicts ;) ), and Github calls my function 10,000 times in a very short period of time.
What can I expect to happen? Specifically,
Will Azure Functions throttle the webhook calls? i.e., will Azure Functions reject certain function calls if my function app is under high load?
Does Azure Functions queue the requests somehow? If so, where/how?
How many instances of my function app will Azure Functions create in this scenario? One for each request (i.e., 10,000), and each will run in parallel?
If my app function was scaled down to zero instances, because there was no load on it, could I expect to see some "warm-up time" before the first function is executed? Roughly how long?
Azure Functions won't reject a webhook call, but in the case of sudden, extreme load, some requests may timeout. For web apis, please include retry on the client, as a best practice.
They aren't queued in any persistent place. They are (implementation detail) managed by IIS.
(Implementation detail) Number of instances isn't a hard set thing. We have certain, unpublished protections in place, but we're designed to scale quite far. Your requests will be handled by multiple instances.
Yes. Right now, it's pretty hefty (seconds), but we'll be working to improve it. For perf sensitive situations, a canary or a timer trigger to keep it awake is recommended.
I'm from the Azure Functions team. The things I marked as implementation details aren't promises and will likely also change as we evolve our service; just an attempt at transparency.
tested today. it took more than seconds :(
ACTUAL PERFORMANCE
--------------
ClientConnected: 13:58:41.589
ClientBeginRequest: 13:58:41.592
GotRequestHeaders: 13:58:41.592
ClientDoneRequest: 13:58:41.592
Determine Gateway: 0ms
DNS Lookup: 65ms
TCP/IP Connect: 40ms
HTTPS Handshake: 114ms
ServerConnected: 13:58:41.703
FiddlerBeginRequest: 13:58:41.816
ServerGotRequest: 13:58:41.817
ServerBeginResponse: 14:00:36.790
GotResponseHeaders: 14:00:36.790
ServerDoneResponse: 14:00:36.790
ClientBeginResponse: 14:00:36.790
ClientDoneResponse: 14:00:36.790
Overall Elapsed: **0:01:55.198**
Related
I have a serverless function that receives orders, about ~30 per day. This function is depending on a third-party API to perform some additional lookups and checks. However, this external endpoint isn't 100% reliable and I need to be able to store order requests if the other API isn't available for a couple of hours (or more..).
My initial thought was to split the function into two, the first part would receive orders, do some initial checks such as validating the order, then post the request into a message queue or pub/sub system. On the other side, there's a consumer that reads orders and tries to perform the API requests, if the API isn't available the orders get posted back into the queue.
However, someone suggested to me to simply use an Azure Durable Function for the requests, and store the current backlog in the function state, using the Aggregator Pattern (especially since the API will be working find 99.99..% of the time). This would make the architecture a lot simpler.
What are the advantages/disadvantages of using one over the other, am I missing any important considerations?
I would appreciate any insight or other suggestions you have. Let me know if additional information is needed.
You could solve this problem with Durable Task Framework or Azure Storage or Service Bus Queues, but at your transaction volume, I think that's overcomplicating the solution.
If you're dealing with ~30 orders per day, consider one of the simpler solutions:
Use Polly, a well-supported resilience and fault-tolerance framework.
Write request information to your database. Have an Azure Function Timer Trigger read occasionally and finish processing orders that aren't marked as complete.
Durable Task Framework is great when you get into serious volume. But there's a non-trivial learning curve for the framework.
We're running our Node backend on Firebase Functions, and have to frequently hit a third-party API (HubSpot), which is rate-limited to 100 requests / 10 seconds.
We're making these requests to HubSpot from our cloud functions, and often find ourselves exceeding HubSpot's rate-limit during campaigns or other website usage spikes. Also, since they are all write requests to update data on HubSpot, these requests cannot be made out of order.
Is there a way to throttle our requests to HubSpot, so as to not exceed their rate limit? Open to suggestions that may not necessarily involve cloud functions, although that would be preferred.
Note: When I say "throttle", I mean that all requests to HubSpot need to go through. I'm trying to achieve something similar to what Lodash's throttle method does, if that makes sense.
What we usually do in this case is store the data into a database, and then pass it over to HubSpot in a tempered way (e.g. without exceeding their rate limit) using a cron that runs every minute. For every data item that we pass to HubSpot successfully, we mark it as "success" in the database.
Cloud Functions can not be rate limited. It will always attempt to service requests and events as fast as they arrive. But you can use Cloud Tasks to create an task queue to spread out the load of some work over time using a configured rate limit. A task queue can target another HTTP function. This effectively makes your processing asynchronous, but is really the only mechanism that Google Cloud gives you to smooth out load.
How to control the usage of APIs by consumers during a given period in Azure function app Http trigger. Simply how to set a requests throttle when exceed the request limit, and please let me know a solution without using azure API Gateway.
The only control you have over host creation in Azure Functions an obscure application setting: WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT. This implies that you can control the number of hosts that are generated, though Microsoft claim that “it’s not completely foolproof” and “is not fully supported”.
From my own experience it only throttles host creation effectively if you set the value to something pretty low, i.e. less than 50. At larger values then its impact is pretty limited. It’s been implied that this feature will be will be worked on in the future, but the corresponding issue has been open in GitHub with no update since July 2017.
For more details, you could refer to this article.
You can use the initialVisibilityDelay property of the CloudQueue.AddMessage function as outlined in this blog post.
This will throttle the message to prevent the 429 error if implemented correctly using the leaky bucket algorithm or equivalent.
I have an Azure function app triggered by an HttpRequest. The function app reads the request, tosses one copy of it into a storage table for safekeeping and sends another copy to a queue for further processing by another element of the system. I have a client running an ApacheBench test that reports approximately 148 requests per second processed. That rate of processing will not be enough for our expected load.
My understanding of function apps is that it should spawn as many instances as is needed to handle the load sent to it. But this function app might not be scaling out quickly enough as it’s only handling that 148 requests per second. I need it to handle at least 200 requests per second.
I’m not 100% sure the problem is on my end, though. In analyzing the performance of my function app I found a LOT of 429 errors. What I found online, particularly https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-manager-request-limits, suggests that these errors could be due to too many requests being sent from a single IP. Would several ApacheBench 10K and 20K request load tests within a given day cause the 429 error?
However, if that’s not it, if the problem is with my function app, how can I force my function app to spawn more instances more quickly? I assume this is the way to get more throughput per second. But I’m still very new at working with function apps so if there is a different way, I would more than welcome your input.
Maybe the Premium app service plan that’s in public preview would handle more throughput? I’ve thought about switching over to that and running a quick test but am unsure if I’d be able to switch back?
Maybe EventHub is something I need to investigate? Is that something that might increase my apparent throughput by catching more requests and holding on to them until the function app could accept and process them?
Thanks in advance for any assistance you can give.
You dont provide much context of you app but this is few steps how you can improve
If you want more control you need to use App Service plan with always on to avoid cold start, also you will need to configure auto scaling since you are responsible in this plan and auto scale is not enabled by default in app service plan.
Your azure function must be fully async as you have external dependencies so you dont want to block thread while you are calling them.
Look on the limits. Using host.json you can tweek it.
429 error means that function is busy to process your request, so probably when you writing to table you are not using async and blocking thread
Function apps work very well and scale as it says. It could be because request coming from Single IP and Azure could be considering it DDOS. You can do the following
AzureDevOps Load Test
You can load test using one of the azure service . I am very sure they have better criteria of handling IPs. Azure DeveOps Load Test
Provision VM in Azure
The way i normally do is provision the VM (windows 10 pro) in azure and use JMeter to Load test. I have use this method to test and it works fine. You can provision couple of them and subdivide the load.
Use professional Load testing services
If possible you may use services like Loader.io . They use sophisticated algos to run the load test and provision bunch of VMs to run the same test.
Use Application Insights
If not already you must be using application insights to have a better look from server perspective. Go to live stream and see how many instance it would provision to handle the load test . You can easily look into events and error logs that may be arising and investigate. You can deep dive into each associated dependency and investigate the problem.
Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?
As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.
As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.