Azure worker role instance got stuck - azure

I have an continuous running Worker role that executes multiple jobs. The jobs are there to process queue messages. Normally if there is an exception or any problem, the job will fail, the queued message will go back into the queue, and the job will try to reprocess.
But I am facing a weird issue since last month that no messages had processed in the past day or so. I investigated on the Azure Portal, and saw that the worker role instance still had a "running" status. For some reason, the job did not time out or quit, but all the messages was sitting in the queue, unprocessed.
There were also no logs or exceptions/errors thrown (I have a decent amount of logging and exception handling in the method).
I restarted the worker role via the Azure Portal, and once that happened, all of the backed up queue messages began processing immediately.
Can anyone help with the solutions or suggestions to handle this case?

RDP to the VM and troubleshoot it just like you would troubleshoot it on-prem. What do performance counters show you? Is your process (or any other) consuming CPU? Anything in the event logs? Take a hang dump of WaWorkerHost.exe and check the callstacks to see what your code is doing or if it is stuck in something like a deadlock or infinite loop.
You can also check the guest agent and host boostrapper logs (see https://blogs.msdn.microsoft.com/kwill/2013/08/09/windows-azure-paas-compute-diagnostics-data/), but since you said the portal was reporting that the instance was in the Ready state then I don't think you will find anything there. It sounds like 'Azure' (the role host processes) are working fine and it is something within WaWorkerHost.exe (your code) that is the problem.

Related

ASP.NET WebApp in Azure using lots of CPU

We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.

Unhealthy Worker role is not restarting and waiting for the status

I have a worker-role, 4 instances, with external TCP/IP endpoint. After several days running without problems, instances began to die, one after one.
In 24 hours all of them were in status:
"Waiting for the status (Role has reported itself as unhealthy.)".
All of them were sending "Working" verbose to log (from Run method), but not accepting any incoming connections. How this could happen ? Unhandled exception from some thread ?
Why all of them were not restarted even after several hours and all of them were just "waiting for the status" ?
If they're endlessly repeating the startup-shutdown-restart cycle, then one possibility is that you have a startup task that works the first time it's run but fails on subsequent runs (ie, it is not idempotent).
Otherwise, it could be that something in your startup process is causing a problem, but it's hard to say without more information. Try to RDP into the machines and look at the Windows Event Logs, and anything else you can think of, for clues.
You might also try Reimaging one or two instances and see if that fixes the problem. That would at least tell you if the problem is in the base code, or if the VM got into some unrecoverable state.

Creating a Web Crawler using Windows Azure

I want to create a Web Crawler, that takes the content of some website and saves it in a blob storage. What is the right way to do that on Azure? Should I start a Worker role, and use the Thread.Sleep method to make it run once a day?
I also wonder, if I use this Worker Role, how would it work if I create two instances of it? I noticed using "Compute Emulator UI" that the command "Trace.WriteLine" works on both instances at the same time, can someone clarify this point.
I created the same crawler using php and set the cron job to start the script once a day, but it took 6 hours to grab the whole content, thats why I want to use Azure.
This is the right way to do it, as of Jan 2014 Microsoft introduced Azure WebJobs, where you can create a project (console for example), and run it as a scheduled task (occurrence once, recurrence)
https://azure.microsoft.com/en-us/documentation/articles/web-sites-create-web-jobs/
http://www.hanselman.com/blog/IntroducingWindowsAzureWebJobs.aspx
Considering that a worker role is basically Windows 2008 Server, you can run the same code you'd run on-premises.
Consider, though, that there are several reasons why a role instance might reboot: OS updates, crash, etc. In these cases, it's possible you'd lose the work being done. So... you can handle this in a few ways:
Queue. Place a message on a command queue. If it's a once-a-day task, you can just push the message on the queue when done processing the previous message. Note that you can put an invisibility timeout on the message, so it doesn't appear for a day. In the event of failure during processing, the message will re-appear on the queue and a different instance can pick it up. You can also modify the message as you go, to keep track of your status.
Scheduler. Just make sure there's only one instance running (by way of a mutex). An easy way to do this is to attempt to obtain a write-lock on a blob (there can only be one).
One thing to consider is breaking up your web-crawl into separate tasks (url's?) and place those individually on the queue? With this, you'd be able to scale, running either multiple instances or, potentially, multiple threads in the same instance (since web-crawling is likely to be a blocking operation, rather than a cpu- and bandwidth-intensive one).
A single worker role running once a day is probably the best approach. I would not use thread sleep though, since you may want to restart the instance and then it may, depening on your programming, start before one day or later than one day. What about putting the task command as a message on the Azure Queue and dequeuing it once it has been picked up by a worker role, then adding a new task command on the Azure Queue once.

Azure ServiceBus Queues -- taking a long time to get local messages?

I'm working through a basic tutorial on the ServiceBus. A web role adds objects to a ServiceBus Queue, while a worker role reads those messages off the queue and marks them complete. This is all within the local environment (compute emulator).
It seems like it should be incredibly simple, but I'm seeing the following behavior:
The call QueueClient.Receive() is always timing out.
Some messages are just hanging out in the queue and are not being picked up by the worker.
What could be going on? How can I debug the state of these messages?
You can check the length of the Queue (from the portal or looking at the MessageCount property).
Another possibility is that the messages are DeadLettered. You can read from the DeadLettered subqueue using this sample code.
First of all, please make sure you indeed have some messages in the queue. I would like to suggest you to run the end solution of this tutorial: http://msdn.microsoft.com/en-us/WAZPlatformTrainingCourse_ServiceBusMessaging. If that works fine, please compare your code with the sample code. If that doesn’t work as well, it is likely to be a configuration issue or a network issue. Then I would recommend you to check whether you have properly configured the Service Bus account, and check if you’re able to access internet from your machine.
Best Regards,
Ming Xu.

WF4 Affinity on Windows Azure and other NLB environments

I'm using Windows Azure and WF4 and my workflow service is hosted in a web-role (with N instances). My job now is find out how
to do an affinity, in a way that I can send messages to the right workflow instance. To explain this scenario, my workflow (attached) starts with a "StartWorkflow" receive activity, creates 3 "Person" and, in a parallel-for-each, waits for the confirmation of these 3 people ("ConfirmCreation" Receive Activity).
I then started to search how the affinity is made in others NLB environments (mainly looked for informations about how this works on Windows Server AppFabric), but I didn't find a precise answer. So how is it done in others NLB environments?
My next task is find out how I could implement a system to handle this affinity on Windows Azure and how much would this solution cost (in price, time and amount of work) to see if its viable or if it's better to work with only one web-role instance while we wait for the WF4 host for the Azure AppFabric. The only way I found was to persist the workflow instance. Is there other ways of doing this?
My third, but not last, task is to find out how WF4 handles multiple messages received at the same time. In my scenario, this means how it would handle if the 3 people confirmed at the same time and the confirmation messages are also received at the same time. Since the most logical answer for this problem seems to be to use a queue, I started looking for information about queues on WF4 and found people speaking about MSQM. But what is the native WF4 messages handler system? Is this handler really a queue or is it another system? How is this concurrency handled?
You shouldn't need any affinity. In fact that's kinda the whole point of durable Workflows. Whilst your workflow is waiting for this confirmation it should be persisted and unloaded from any one server.
As far as persistence goes for Windows Azure you would either need to hack the standard SQL persistence scripts so that they work on SQL Azure or write your own InstanceStore implementation that sits on top of Azure Storage. We have done the latter for a workflow we're running in Azure, but I'm unable to share the code. On a scale of 1 to 10 for effort, I'd rank it around an 8.
As far as multiple messages, what will happen is the messages will be received and delivered to the workflow instance one message at a time. Now, it's possible that every one of those messages goes to the same server or maybe each one goes to a diff. server. No matter how it happens, the workflow runtime will attempt to load the workflow from the instance store, see that it is currently locked and block/retry until the workflow becomes available to process the next message. So you don't have to worry about concurrent access to the same workflow instance as long as you configure everything correctly and the InstanceStore implementation is doing its job.
Here's a few other suggestions:
Make sure you use the PersistBeforeSend option on your SendReply actvities
Configure the following workflow service options
<workflowIdle timeToUnload="00:00:00" />
<sqlWorkflowInstanceStore ... instanceLockedExceptionAction="AggressiveRetry" />
Using the out of the box SQL instance store with SQL Azure is a bit of a problem at the moment with the Azure 1.3 SDK as each deployment, even if you made 0 code changes, results in a new service deployment meaning that already persisted workflows can't continue. That is a bug that will be solved but a PITA for now.
As Drew said your workflow instance should just move from server to server as needed, no need to pin it to a specific machine. And even if you could that would hurt scalability and reliability so something to be avoided.
Sending messages through MSMQ using the WCF NetMsmqBinding works just fine. Internally WF uses a completely different mechanism called bookmarks that allow a workflow to stop and resume. Each Receive activity, as well as others like Delay, will create a bookmark and wait for that to be resumed. You can only resume existing bookmarks. Even resuming a bookmark is not a direct action but put into an internal queue, not MSMQ, by the workflow scheduler and executed through a SynchronizationContext. You get no control over the scheduler but you can replace the SynchronizationContext when using the WorkflowApplication and so get some control over how and where activities are executed.

Resources