Akka messages wait in mailbox even when no messages are currently being processed by the actor - azure-web-app-service

I'm having an issue where every so often Akka messages are being held in an actor's mailbox for longer than necessary, sometimes for up to around 3 seconds.
This is only happening when I run the app (ASP.NET web app) in Azure - there are no delays when I run the code on my laptop.
The app is fairly lightweight, and the CPU and memory on the service plan never go above 60% and 70% respectively. I even tried temporarily upgrading the service plan but that had no impact on the wait times.
I've added in some custom logging to flag when a message has waited in the mailbox for a significant amount of time. (Note the ordering is in descending timestamp)
I've highlighted the messages that are causing the issue - these ones are NOT be waiting for a previous message to finish processing in the actor, so for these the MessageEntered time should be the same as the MessageCreated time.
I'm wondering if there is some kind of thread limit being hit within the app, as lots of other actors are running threads in parallel...?
Any ideas on what's going on / how to resolve this?
Thanks

Related

Understand why Azure App Service has delay to start processing requests (AppInsight Tracing)

We have a doubt about Azure because in some cases we have some dead times when we received requests in one of our AppServices or when a Service Bus triggers, for example, an Azure Functions.
If you see this image, you will see an example:
AppInsight Example Image
We execute a Request and at 5 seconds, but Azure delays more than 30 seconds to start the execution. We made a lot of optimizations in our apps, but we have no visibility about this delay.
Did someone face the same issue and found some solution? We believe it is a performance issue in the Workers, but, this happens also when the Workers are with a low load of memory and CPU. So we don't know how to scale horizontally automatically the resource if it is without load.
This happens also in our AZF, but we believe it's an issue between the Service Bus and the container of the AZF. In these cases we found the AZF has a higher consumption of CPU, but we don't why, because in the local environment we process a lot of messages with multithreading without any problem.

Limit Azure Function restart rate

I already faced similar problem few times:
Azure Function with ServiceBusTrigger by some reason (misconfiguration, infrastructure issues, doesn't really matter) fails to connect to ServiceBus (so it happens on trigger level) and it leads to two issues:
It tries to restart all the time, increasing CPU consumption
It generates literally a millions of exceptions in AppInsights, which leads to quota exceedance
Practically every error in configuration means significantly increased bills and requires thorough monitoring after every deployment, which is annoying and error prone solution.
So, my question: If there is a way to set some delay between restart attempts to (for example) one second? And, as addition - is there way to limit amount of restart attempts and then shut down the Function?
Establishing a connection to the broker to fetch messages is Functions responsibility, Scale Controller. That aspect is entirely abstracted from customers and not configurable. I suggest raising an issue with Azure Functions team, likely under the Runtime repo.

ASP.NET WebApp in Azure using lots of CPU

We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.

How does one configure "Autoscale" to deal with Web instances which have long wait times due to external processes?

I am using MVC3, ASP.NET4.5, C#, Razor, EF6.1, SQL Azure
I have been doing some load testing using JMeter, and I have found some surprising results.
I have a test of 30 concurrent users, ramping up over 10 secs. The test plan is fairly simple:
Login
Navigate to page
Do query
Navigate back
Logout
I am using "small" "standard" instances.
I have noticed that web instances may be waiting on external processes, such as databases queries, so the web CPU could be low, but it is still a bottleneck. The CPU could be idling at 40% while waiting for a result set from the DB. So this could also be a reason why the extra instance may not be triggered. Actually this is a real issue. How do you trigger extra instance based on longer wait times? At the moment the only way round this is to have 2 instances up there permanently, or proactively set it up against a schedule.
Use async calls and you won't have to worry about scaling up. The waiting threads will be asleep, freeing up resources to handle other users.
If you still see lengthened response times after that it's probably the external process that's choking and in need of being scaled up

Orchestrating a Windows Azure web role to cope with occasional high workload

I'm running a Windows Azure web role which, on most days, receives very low traffic, but there are some (foreseeable) events which can lead to a high amount of background work which has to be done. The background work consists of many database calls (Azure SQL) and HTTP calls to external web services, so it is not really CPU-intensive, but it requires a lot of threads which are waiting for the database or the web service to answer. The background work is triggered by a normal HTTP request to the web role.
I see two options to orchestrate this, and I'm not sure which one is better.
Option 1, Threads: When the request for the background work comes in, the web role starts as many threads as necessary (or queues the individual work items to the thread pool). In this option, I would configure a larger instance during the heavy workload, because these threads could require a lot of memory.
Option 2, Self-Invoking: When the request for the background work comes in, the web role which receives it generates a HTTP request to itself for every item of background work. In this option, I could configure several web role instances, because the load balancer of Windows Azure balances the HTTP requests across the instances.
Option 1 is somewhat more straightforward, but it has the disadvantage that only one instance can process the background work. If I want more than one Azure instance to participate in the background work, I don't see any other option than sending HTTP requests from the role to itself, so that the load balancer can delegate some of the work to the other instances.
Maybe there are other options?
EDIT: Some more thoughts about option 2: When the request for the background work comes in, the instance that receives it would save the work to be done in some kind of queue (either Windows Azure Queues or some SQL table which works as a task queue). Then, it would generate a lot of HTTP requests to itself, so that the load balancer 'activates' all of the role instances. Each instance then dequeues a task from the queue and performs the task, then fetches the next task etc. until all tasks are done. It's like occasionally using the web role as a worker role.
I'm aware this approach has a smelly air (abusing web roles as worker roles, HTTP requests to the same web role), but I don't see the real disadvantages.
EDIT 2: I see that I should have elaborated a little bit more about the exact circumstances of the app:
The app needs to do some small tasks all the time. These tasks usually don't take more than 1-10 seconds, and they don't require a lot of CPU work. On normal days, we have only 50-100 tasks to be done, but on 'special days' (New Year is one of them), they could go into several 10'000 tasks which have to be done inside of a 1-2 hour window. The tasks are done in a web role, and we have a Cron Job which initiates the tasks every minute. So, every minute the web role receives a request to process new tasks, so it checks which tasks have to be processed, adds them to some sort of queue (currently it's an SQL table with an UPDATE with OUTPUT INSERTED, but we intend to switch to Azure Queues sometime). Currently, the same instance processes the tasks immediately after queueing them, but this won't scale, since the serial processing of several 10'000 tasks takes too long. That's the reason why we're looking for a mechanism to broadcast the event "tasks are available" from the initial instance to the others.
Have you considered using Queues for distribution of work? You can put the "tasks" which needs to be processed in queue and then distribute the work to many worker processes.
The problem I see with approach 1 is that I see this as a "Scale Up" pattern and not "Scale Out" pattern. By deploying many small VM instances instead of one large instance will give you more scalability + availability IMHO. Furthermore you mentioned that your jobs are not CPU intensive. If you consider X-Small instance, for the cost of 1 Small instance ($0.12 / hour), you can deploy 6 X-Small instances ($0.02 / hour) and likewise for the cost of 1 Large instance ($0.48) you could deploy 24 X-Small instances.
Furthermore it's easy to scale in case of a "Scale Out" pattern as you just add or remove instances. In case of "Scale Up" (or "Scale Down") pattern since you're changing the VM Size, you would end up redeploying the package.
Sorry, if I went a bit tangential :) Hope this helps.
I agree with Gaurav and others to consider one of the Azure Queue options. This is really a convenient pattern for cleanly separating concerns while also smoothing out the load.
This basic Queue-Centric Workflow (QCW) pattern has the work request placed on a queue in the handling of the Web Role's HTTP request (the mechanism that triggers the work, apparently done via a cron job that invokes wget). Then the IIS web server in the Web Role goes on doing what it does best: handling HTTP requests. It does not require any support from a load balancer.
The Web Role needs to accept requests as fast as they come (then enqueues a message for each), but the dequeue part is a pull so the load can easily be tuned for available capacity (or capacity tuned for the load! this is the cloud!). You can choose to handle these one at a time, two at a time, or N at a time: whatever your testing (sizing exercise) tells you is the right fit for the size VM you deploy.
As you probably also are aware, the RoleEntryPoint::Run method on the Web Role can also be implemented to do work continually. The default implementation on the Web Role essentially just sleeps forever, but you could implement an infinite loop to query the queue to remove work and process it (and don't forget to Sleep whenever no messages are available from the queue! failure to do so will cause a money leak and may get you throttled). As Gaurav mentions, there are some other considerations in robustly implementing this QCW pattern (what happens if my node fails, or if there's a bad ("poison") message, bug in my code, etc.), but your use case does not seem overly concerned with this since the next kick from the cron job apparently would account for any (rare, but possible) failures in the infrastructure and perhaps assumes no fatal bugs (so you can't get stuck with poison messages), etc.
Decoupling placing items on the queue from processing items from the queue is really a logical design point. By this I mean you could change this at any time and move the processing side (the code pulling from the queue) to another application tier (a service tier) rather easily without breaking any part of the essential design. This gives a lot of flexibility. You could even run everything on a single Web Role node (or two if you need the SLA - not sure you do based on some of your comments) most of the time (two-tier), then go three-tier as needed by adding a bunch of processing VMs, such as for the New Year.
The number of processing nodes could also be adjusted dynamically based on signals from the environment - for example, if the queue length is growing or above some threshold, add more processing nodes. This is the cloud and this machinery can be fully automated.
Now getting more speculative since I don't really know much about your app...
By using the Run method mentioned earlier, you might be able to eliminate the cron job as well and do that work in that infinite loop; this depends on complexity of cron scheduling of course. Or you could also possibly even eliminate the entire Web tier (the Web Role) by having your cron job place work request items directly on the queue (perhaps using one of the SDKs). You still need code to process the requests, which could of course still be your Web Role, but at that point could just as easily use a Worker Role.
[Adding as a separate answer to avoid SO telling me to switch to chat mode + bypass comments length limitation] & thinking out loud :)
I see your point. Basically through HTTP request, you're kind of broadcasting the availability of a new task to be processed to other instances.
So if I understand correctly, when an instance receives request for the task to be processed, it pushes that request in some kind of queue (like you mentioned it could either be Windows Azure Queues [personally I would actually prefer that] or SQL Azure database [Not prefer that because you would have to implement your own message locking algorithm]) and then broadcast a message to all instances that some work needs to be done. Remaining instances (or may be the instance which is broadcasting it) can then see if they're free to process that task. One instance depending on its availability can then fetch the task from the queue and start processing that task.
Assuming you used Windows Azure Queues, when an instance fetched the message, it becomes unavailable to other instances immediately for some amount of time (visibility timeout period of Azure queues) thus avoiding duplicate processing of the task. If the task is processed successfully, the instance working on that task can delete the message.
If for some reason, the task is not processed, it will automatically reappear in the queue after visibility timeout period has expired. This however leads to another problem. Since your instances look for tasks based on a trigger (generating HTTP request) rather than polling, how will you ensure that all tasks get done? Assuming you get to process just one task and one task only and it fails since you didn't get a request to process the 2nd task, the 1st task will never get processed again. Obviously it won't happen in practical situation but something you might want to think about.
Does this make sense?
i would definitely go for a scale out solution: less complex, more manageable and better in pricing. Plus you have a lesser risk on downtime in case of deployment failure (of course the mechanism of fault and upgrade domains should cover that, but nevertheless). so for that matter i completely back Gaurav on this one!

Resources