Utilities to find Gen1, Gen 2 objects promoted in BizTalk Orchestration? - garbage-collection

I am trying to profile a memory guzzling biztalk orchestration. This orchestration is using plenty of maps with a few custom script functoids.
I wish to identify the lifetime of the custom objects created by orchestration and how are they promoted in the Garbagage collection. Am not able to use the CLR profiler for the same.
Looking out for pointers to identify the objects that are getting promoted in the Garbage collection.

Hey Pari, here are a couple of things to try. The first, I have personally used the Ants Profiler on BizTalk processes before, here's a link:
http://www.red-gate.com/supportcenter/Content.aspx?p=ANTS%20Profiler&c=knowledgebase%5CANTS%5FProfiler%5CKB200801000222.htm
Second, you can also do some spelunking with the Windows Perfmon, but it won't be as granular as the object level, but it will tell you if you are having troubles with promotions, the large heap, etc.
You should be able to get this information using Perfmon in Windows. The basic ones you are probably after are in the .NET CLR Memory object. Unfortunately you will have to add them for all of the BizTalk processes, as the naming convention does not allow you to see the host name here. You will see them listed here, and in the log, as BTSNTSvc, BTSNTSvc#1, BTSNTSvc#2, etc. It will take a little extra work to identify which one is the process you are interested in.
The counter that will allow you to identify the correct process is the the Process\ID Process counter, again for each BizTalk process. This will allow you to connect the PID from the process when it starts, to the PID in the Perfmon logs later.
The last step is to create a new host, if you haven't already, and isolate the orchestration to it. That way it is the only thing running in that BizTalk process. After that you can open up Windows Task Manager and view all of the BTSNTSvc.exe processes that are running. Start with the new process off, check the PIDs listed in Task Manager, and then turn on the new host. The new PID is the one assigned to the host you just turned on. Record the PID that the new process is generated with and use it to identify which process in the Permon logs you are interested in. Unfortunately you will have to repeat this step every time you want to take measurements.
One last thing to mention, when you turn your Perfmon log on it will only record the hosts that are on at the time. So you will want to turn it on after you turn the host with the orchestration on.
You may want to also check out the objects in the Biztalk:MessageAgent, as there are some good memory counters in there for BizTalk with the actual names associated with them. It's not as granular as what you are looking for.
I have also heard of an orchestration profiler, but I have never used it. You might give it a shot:
http://www.codeplex.com/BiztalkOrcProfiler

Related

How to Use Powershell to Kill threads of a specific processID

well this has been bugging me for a couple of days on and off. I am at a clients site where they have a number of bespoke, written in house, services running on a Windows 2008R2 IIS server. The problem is that a couple of these services keep hanging, they are stuck in a “Stopping” state and the only way to kill them off is to open process explorer and kill the threads. Before anyone says anything about using ‘runas’, or logging on as the local admin, or the service owner, etc we’ve been through all of that.
The problem lies with the executable itselfs. The development team, in another country are going to look at this but it will take 4-5 months minimum, and we’re not certain they’ll get it right then.
I have a Powershell script to check the services on a regular basis which has the ability to ensure the services are running and if not, the force a stop and restart of the service, then it sends an email to confirm the actions. However with these specific services mentioned it can do nothing. They can’t be killed in task manager, taskkill, or process explorer (unless one kills the threads) it just says access denied. It is possible to change the permissions in process explorer and kill it but that’s a lengthier process than killing the threads.
To make things a little more difficult I can’t use the process name as on this server there are two other websites using an exe with the same name, just in a different folder.
What I’m after is a way to find and kill the threads of a processID, which I’ve already obtained via the script I have, so the rest of the script can complete the task of restarting the said service. At the moment this service dies on an inconsistent basis throughout the day and night, and the support guys have to RDP onto the server, open process explorer, find the offending process and kill the threads off then restart the services. A bit too much hassle for these already over worked guys especially if we can get powershell to do it automatically.
Hope someone can help on this. Thanks in advance.
Low level thread handling is likely to require native Win32 API usage. Powershell might help with P/Invoke, but the process is going to be complex. For starters, find out if the following tools can be used to identify the stuck thread. Maybe you can combine this info with some Sysinternals tools like handle.exe to find out what really blocks the thread.
The .Net framework has some tools available via System.Diagnostics.Process namespace. A list for threads for named process is available like so,
$ps = [diagnostics.process]::getProcessesByName("iexplore")
$p = $ps[0]
$p.Threads[0]
Full documentation is in MSDN. There is no method for killing a thread, but this should be kind of starting point for identifying the stuck one.
Another a way is to use WMI to get win32_thread data like so,
$threads = gwmi win32_thread
The output is quite different and some filtering is needed. Some examples are available. Another a WMI solution attempt might be based on Win32_process that has Terminate method.

Determining cause of CPU spike in azure

I am relatively new to Azure. I have a website that has been running for a couple of months with not too much traffic...when users are on the system, the various dashboard monitors go up and then flat line the rest of the time. This week, the CPU time when way up when there were no requests and data going in or out of the site. Is there a way to determine the cause of this CPU activity when the site is not active? It doesn't make sense to me that I should have CPU activity being assigned to my site when there is to site activity.
If your website has significant processing at application start, it is possible your VM got rebooted or your app pool recycled and your onstart handler got executed again (which would cause CPU to spike without any request).
You can analyze this by adding application logs to your Application_Start event (but after initializing trace). There is another comment detailing how to enable logging, but you can also consult this link.
You need to collect data to understand what's going on. So first thing I would say is:
1. Go to Azure management portal -> your website (assuming you are using Azure websites) -> dashboard -> operation logs. Try to see whether there is any suspicious activity going on.
download the logs for your site using any ftp client and analyze what's happening. If there is not much data, I would suggest adding more logging in your application to see what is happening or which module is spinning.
A great way to detect CPU spikes and even determine slow running areas of your application is to use a profiler like New Relic. It's a free add on for Azure that collects data and provides you with a dashboard of data. You might find it useful to determine the exact cause of the CPU spike.
We regularly use it to monitor the performance of our applications. I would recommend it.

Azure Development - How to stop a Web Role instance

I need to test how my code will handle the failure of a web role instance in a development environment.
How do I terminate one of the instances? I can't see any option in the UI for this. Seems like a strange ommission
Update
The issue is relating to a distributed cache layer (I know that azure offers their own)
I want to be able to test how the system reacts to a missing or additional node etc
Prehaps my real question is
how up to date is RoleEnvironment.CurrentRoleInstance.Role.Instances
The need to simulate ungraceful exits in the dev emulator usually is done because you are doing something in your web role that is stateful or long running. That is generally discouraged, but sometimes is unavoidable.
I suspect the best way to simulate the a failure is to kill processes. If you open task manager (or better Process Explorer), you will see "WatDebugger" hosting either "WaIISHost" or "WaWorkerHost". If you kill this process, I think it will simulate a failure.
Honestly, it is easier to test this one in the cloud however. You can RDP into one of the instances and kill the 'WaAppAgent' process. That will kill your RoleEntryPoint and fabric controller agent. That will be a true ungraceful failure.
By failure, do you mean becoming unavailable? It should be seamless because the next request would simply be handled by one of the other instances. As long as there is one instance available Azure will route calls to that instance.
This is the nature of a high-available system, requests are handled by the available instances. This is why you have multiple instances in the first place, to handle requests in the case of failure in one or more instances.
This is why you need to always be watchful of how your application handles state. State needs to be maintained outside of the instance, either in queues or in a database. This ensures that any process can pickup a piece of work and execute against it.
There is another question dealing with Session State that should help: How does Microsoft Azure handle Session State?
By terminate an instance do you mean reducing instance count and see which one gets killed? I like Ryan's view about ungraceful exits, but if it's forced kill by the fabric it'll be a different ball game.

WF4 Affinity on Windows Azure and other NLB environments

I'm using Windows Azure and WF4 and my workflow service is hosted in a web-role (with N instances). My job now is find out how
to do an affinity, in a way that I can send messages to the right workflow instance. To explain this scenario, my workflow (attached) starts with a "StartWorkflow" receive activity, creates 3 "Person" and, in a parallel-for-each, waits for the confirmation of these 3 people ("ConfirmCreation" Receive Activity).
I then started to search how the affinity is made in others NLB environments (mainly looked for informations about how this works on Windows Server AppFabric), but I didn't find a precise answer. So how is it done in others NLB environments?
My next task is find out how I could implement a system to handle this affinity on Windows Azure and how much would this solution cost (in price, time and amount of work) to see if its viable or if it's better to work with only one web-role instance while we wait for the WF4 host for the Azure AppFabric. The only way I found was to persist the workflow instance. Is there other ways of doing this?
My third, but not last, task is to find out how WF4 handles multiple messages received at the same time. In my scenario, this means how it would handle if the 3 people confirmed at the same time and the confirmation messages are also received at the same time. Since the most logical answer for this problem seems to be to use a queue, I started looking for information about queues on WF4 and found people speaking about MSQM. But what is the native WF4 messages handler system? Is this handler really a queue or is it another system? How is this concurrency handled?
You shouldn't need any affinity. In fact that's kinda the whole point of durable Workflows. Whilst your workflow is waiting for this confirmation it should be persisted and unloaded from any one server.
As far as persistence goes for Windows Azure you would either need to hack the standard SQL persistence scripts so that they work on SQL Azure or write your own InstanceStore implementation that sits on top of Azure Storage. We have done the latter for a workflow we're running in Azure, but I'm unable to share the code. On a scale of 1 to 10 for effort, I'd rank it around an 8.
As far as multiple messages, what will happen is the messages will be received and delivered to the workflow instance one message at a time. Now, it's possible that every one of those messages goes to the same server or maybe each one goes to a diff. server. No matter how it happens, the workflow runtime will attempt to load the workflow from the instance store, see that it is currently locked and block/retry until the workflow becomes available to process the next message. So you don't have to worry about concurrent access to the same workflow instance as long as you configure everything correctly and the InstanceStore implementation is doing its job.
Here's a few other suggestions:
Make sure you use the PersistBeforeSend option on your SendReply actvities
Configure the following workflow service options
<workflowIdle timeToUnload="00:00:00" />
<sqlWorkflowInstanceStore ... instanceLockedExceptionAction="AggressiveRetry" />
Using the out of the box SQL instance store with SQL Azure is a bit of a problem at the moment with the Azure 1.3 SDK as each deployment, even if you made 0 code changes, results in a new service deployment meaning that already persisted workflows can't continue. That is a bug that will be solved but a PITA for now.
As Drew said your workflow instance should just move from server to server as needed, no need to pin it to a specific machine. And even if you could that would hurt scalability and reliability so something to be avoided.
Sending messages through MSMQ using the WCF NetMsmqBinding works just fine. Internally WF uses a completely different mechanism called bookmarks that allow a workflow to stop and resume. Each Receive activity, as well as others like Delay, will create a bookmark and wait for that to be resumed. You can only resume existing bookmarks. Even resuming a bookmark is not a direct action but put into an internal queue, not MSMQ, by the workflow scheduler and executed through a SynchronizationContext. You get no control over the scheduler but you can replace the SynchronizationContext when using the WorkflowApplication and so get some control over how and where activities are executed.

Watchdog win service to watch another win service

I want to make a windows service that monitors another windows service, and make sure that it is working.
sometimes the Win Service that I want to watch stay in the memory (appear in task manager, so it is considered a running service, but the fact is that it is doing nothing, it is dead, its timer is not firing for one reason, which is not the subject for this question).
what I need is to make a watch dog Win Service that somehow reads a value in the memory that the other watched service is periodically writing.
I thought about using Named Pipes but I don't want to add communication issues to my services, I want to know if there is a way to create such a shared memory between 2 applications (possibly using a named system wide Mutex?)
Since you have to deal with detecting a zombie service I don't think using a kernel object like a mutex will help, you need to detect activity. A semaphore isn't a good fit either.
My personal preference would be a named pipe sending small heartbeat messages (since that could be detected across a network as well), but if you want to avoid the complexity of pipe comms - which I guess is understandable - then you could update a DWORD in a predetermined registry key. If both services run under LocalSystem you could write a key/value into HKEY_LOCAL_MACHINE. Run a pump-up timer and watch for changes to the key every so often (watch out for counter wrap-around). You won't have a normal window/message pump so SetTimer is off-limits, but you can still use timeSetEvent or waitable timers.
HKLM won't be available if one of the services runs under a non-admin account, but that's a pretty rare situation for services. Of course all this assumes you have access to the code of both services. Watching a third-party service would severely limit your options.

Resources