Deleting dead topics in Azure Service Bus - azure

I've tried to do my homework on this issue but no searches I can make have gotten me closer to the answer. Closest hit was Detect and Delete Orphaned Queues, Topics, or Subscriptions on Azure Service Bus.
My scenario:
I have multiple services running (standard win service). At startup these processes starts to subscribe to a given topic in Azure Service Bus. Let's call the topic "Messages".
When the service is shut down it unsubcribes in a nice way.
But sometimes stuff happens and the service crashes, causing the unsubscription to fail and the subscription then is left hanging.
My questions:
1) From what I'm seeing, each dead topic subscription counts when a message is sent to that topic. Even if no one is ever going to pick it up. Fact or fiction?
2) Is there anyway to remove subscriptions that haven't been checked for a while, for example for the last 24h? Preferrably by a Power Shell script?
I've raised this issue directly with Microsoft but haven't received any answer yet. Surely, I can't be the first to experience this. I'll also update this if I get any third party info.
Thanks
Johan

In the Azure SDK 2.0 release we have addressed this scenario with the AutoDeleteOnIdle feature. This will allow you to set a timespan on a Queue/Topic/Subscription and the when no activity is detected for the specified duration, the entity will automatically be deleted. See details here, and the property to set is here.

On your 1) question, yes messages sent to a topic will be sent to any matching subscription, even if that is Idle (based on your own logic). A subscription is a permanent artifact that you create that is open to receive messages, even when no services are dequeuing messages.
To clean out subscriptions, you can probably use the AccessedAt property of the SubscriptionDescription and use that to check when someone last read the queue (by a Receive operation).
http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.subscriptiondescription.accessedat.aspx
If you use that logic, you can build your own 'cleansing' mechanisms
HTH

Related

Diagnosing errors in StreamAnalytics Jobs

I've got a series of services that generate events that are being written to an Azure Event Hub. This hub's connected to a StreamAnalytics Job that takes the event information and writes it to an Azure TableStorage and DataLake Store for later analysis by different teams and tools.
One of my services is reporting all events correctly, but the other isn't, after hooking up a listener to the hub I can see the events are being sent without a problem, but they aren't being processed or sent to the sinks on the job.
On the audit logs I see periodic transformation errors for one of the columns that's written to the storage, but seeing the data there's no problem on the format, and I can't seem to find a way to maybe look at the troubled events that are causing this failures.
The only error I see on the Management Services is
We are experiencing issues writing output for output TSEventStore right now. We will try again soon.
It sounds like there may be two issues:
1) The writing to the TableStorage TSEventStore is failing.
2) There are some data conversion errors.
I would suggest trying to troubleshoot one at a time. For the first one, are there any events being written to the TSEventStore? Is there another message in operations logs that may give more detail on why writing is failing?
For the second one, today we don't have a way to output events that have data conversation errors. The best way is by outputting the data only to one sync (data lake) and looking at it there.
Thanks,
Kati

Worker Role goes Cycling... after some time, who can I get an alert?

I have a worker role deployed that works fine for a period of time (days...) but at some point it stops or crashes, then it can't restart at all and stays "Cycling...". The only solution is to Reimage the Role.
How can I set an automatic alert so I get an email when the Role becomes unresponsive (and Cycling...) ?
Thanks
Alerts or notifications like this are not available today, but they are being worked on. If this is causing service interruptions you could always sign up for an external monitoring service which will send you alerts whenever your site is down.
However, I would recommend solving the root cause of the problem rather than just Reimaging it to fix the symptom. Here is how I would start:
You are most likely hitting the issue described in http://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx. In particular, see #1 under Common Issues where it talks about common causes for a role to not restart properly after being rebooted due to OS updates. Notice that #1 also talks about how to simulate these types of Azure environment issues (ie. manually do a Reboot from the portal) so you can reproduce the failure and debug it.
To troubleshoot the issue I would recommend reading through the troubleshooting series at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx. Of particular interest to you is probably the "Troubleshooting Scenario 2 – Role Recycling After Running Fine For 2 Weeks"
Azure cannot notify you of such conditions. Consider placing a try/catch around your loop in the WorkerRole with a catch that can email you in case of an issue.
Alternatively, if you're open to using third party services, consider AzureWatch (I'm affiliated with the product). It can alert you in case your instance becomes Unresponsive, Busy, or goes thru other non-Ready status

Azure ServiceBus Queues -- taking a long time to get local messages?

I'm working through a basic tutorial on the ServiceBus. A web role adds objects to a ServiceBus Queue, while a worker role reads those messages off the queue and marks them complete. This is all within the local environment (compute emulator).
It seems like it should be incredibly simple, but I'm seeing the following behavior:
The call QueueClient.Receive() is always timing out.
Some messages are just hanging out in the queue and are not being picked up by the worker.
What could be going on? How can I debug the state of these messages?
You can check the length of the Queue (from the portal or looking at the MessageCount property).
Another possibility is that the messages are DeadLettered. You can read from the DeadLettered subqueue using this sample code.
First of all, please make sure you indeed have some messages in the queue. I would like to suggest you to run the end solution of this tutorial: http://msdn.microsoft.com/en-us/WAZPlatformTrainingCourse_ServiceBusMessaging. If that works fine, please compare your code with the sample code. If that doesn’t work as well, it is likely to be a configuration issue or a network issue. Then I would recommend you to check whether you have properly configured the Service Bus account, and check if you’re able to access internet from your machine.
Best Regards,
Ming Xu.

What happens to Azure diagnostic information when a role stops?

When an Azure worker role stops (either because of an unhandled exception or because Run() finishes), what happens to local diagnostic information that has not yet been transferred? Microsoft documentation says diagnostics are transferred to storage at scheduled intervals or on demand, neither of which can cover an unhandled exception. Does this mean diagnostic information is always lost in this case? This seems particularly odd because crash dumps are part of the diagnostic data (set up by default in DiagnosticMonitorConfiguration.Directories). How then can you ever get a crash dump back (related to this question)?
To me it would be logical if diagnostics were also transferred when a role terminates, but this is not my experience.
It depends on what you mean by 'role stops'. The Diagnostic Monitor in SDK 1.3 and later is implemented as a background task that has no dependency on the RoleEntryPoint. So, if you mean your RoleEntryPoint is reporting itself as unhealthy or something like that, then your DiagnosticMonitor (DM) will still be responsive and will send data according to the configuration you have setup.
However, if you mean that a role stop is a scale down operation (shutting down the VM), then no, there is no flush of the data on disk. At that point, the VM is shutdown and the DM with it. Anything not already flushed (transferred) can be considered lost.
If you are only rebooting the VM, then in theory you will be connected back to the same resource VHDs that hold the buffered diagnostics data so you would not lose it, it would be transferred on next request. I am pretty sure that sticky storage is enabled on it, so it won't be cleaned on reboot.
HTH.
The diagnostic data is stored locally before it is transferred to storage. So that information is available to you there; you can review/verify this by using RDP to check it out.
I honestly have not tested to see if it gets transferred after the role stops. However, you can request transfers on demand. So using that approach, you could request the logs/dumps to be transferred one more time after the role has stopped.
I would suggest checking out a tool like Cerebrata Azure Diagnostics Manager to request on demand transfer of your logs, and also analyze the data.
I answered your other question as well. Part of my answer was to add the event that would allow you to change your logging and transfer settings on the fly.
Hope this helps
I think it works like this: local diagnostic data is stored in the local storage named "DiagnosticStore", which I guess has cleanOnRoleRecycle set to false. (I don't know how to verify this last bit - LocalResource has no corresponding attribute.) When the role is recycled that data remains in place and will eventually be uploaded by the new diagnostic monitor (assuming the role doesn't keep crashing before it can finish).

WF4 Affinity on Windows Azure and other NLB environments

I'm using Windows Azure and WF4 and my workflow service is hosted in a web-role (with N instances). My job now is find out how
to do an affinity, in a way that I can send messages to the right workflow instance. To explain this scenario, my workflow (attached) starts with a "StartWorkflow" receive activity, creates 3 "Person" and, in a parallel-for-each, waits for the confirmation of these 3 people ("ConfirmCreation" Receive Activity).
I then started to search how the affinity is made in others NLB environments (mainly looked for informations about how this works on Windows Server AppFabric), but I didn't find a precise answer. So how is it done in others NLB environments?
My next task is find out how I could implement a system to handle this affinity on Windows Azure and how much would this solution cost (in price, time and amount of work) to see if its viable or if it's better to work with only one web-role instance while we wait for the WF4 host for the Azure AppFabric. The only way I found was to persist the workflow instance. Is there other ways of doing this?
My third, but not last, task is to find out how WF4 handles multiple messages received at the same time. In my scenario, this means how it would handle if the 3 people confirmed at the same time and the confirmation messages are also received at the same time. Since the most logical answer for this problem seems to be to use a queue, I started looking for information about queues on WF4 and found people speaking about MSQM. But what is the native WF4 messages handler system? Is this handler really a queue or is it another system? How is this concurrency handled?
You shouldn't need any affinity. In fact that's kinda the whole point of durable Workflows. Whilst your workflow is waiting for this confirmation it should be persisted and unloaded from any one server.
As far as persistence goes for Windows Azure you would either need to hack the standard SQL persistence scripts so that they work on SQL Azure or write your own InstanceStore implementation that sits on top of Azure Storage. We have done the latter for a workflow we're running in Azure, but I'm unable to share the code. On a scale of 1 to 10 for effort, I'd rank it around an 8.
As far as multiple messages, what will happen is the messages will be received and delivered to the workflow instance one message at a time. Now, it's possible that every one of those messages goes to the same server or maybe each one goes to a diff. server. No matter how it happens, the workflow runtime will attempt to load the workflow from the instance store, see that it is currently locked and block/retry until the workflow becomes available to process the next message. So you don't have to worry about concurrent access to the same workflow instance as long as you configure everything correctly and the InstanceStore implementation is doing its job.
Here's a few other suggestions:
Make sure you use the PersistBeforeSend option on your SendReply actvities
Configure the following workflow service options
<workflowIdle timeToUnload="00:00:00" />
<sqlWorkflowInstanceStore ... instanceLockedExceptionAction="AggressiveRetry" />
Using the out of the box SQL instance store with SQL Azure is a bit of a problem at the moment with the Azure 1.3 SDK as each deployment, even if you made 0 code changes, results in a new service deployment meaning that already persisted workflows can't continue. That is a bug that will be solved but a PITA for now.
As Drew said your workflow instance should just move from server to server as needed, no need to pin it to a specific machine. And even if you could that would hurt scalability and reliability so something to be avoided.
Sending messages through MSMQ using the WCF NetMsmqBinding works just fine. Internally WF uses a completely different mechanism called bookmarks that allow a workflow to stop and resume. Each Receive activity, as well as others like Delay, will create a bookmark and wait for that to be resumed. You can only resume existing bookmarks. Even resuming a bookmark is not a direct action but put into an internal queue, not MSMQ, by the workflow scheduler and executed through a SynchronizationContext. You get no control over the scheduler but you can replace the SynchronizationContext when using the WorkflowApplication and so get some control over how and where activities are executed.

Resources