Silverlight Sculpture generated application locking up on service calls (on some machines) - multithreading

We have an application generated using the Sculpture software package. That means the project is roughly equivalent to the code in a Prism application.
Part of their model is that all WCF Service calls are performed synchronously, but on background threads (actually they are async calls as well, but the Sculpture background thread methods wait around for the response before executing any following code).
When we deployed the application, we found that around 50% of all machines tested would not get past the first service call. We cannot see any pattern in the machines that fail as they are have a mixture of both Debug and Release Silverlight runtime and Windows 7 on machines that work as well as fail. It fails the same on different browser so is machine specific. The only clue is they all seem to be older PCs.
Ideas anyone?

Found the cause. There is a schoolboy error in their generated service calls.
What's wrong with this picture?:
while (true == userState.IsBusy)
{}
Ignoring the old-school use of true == (not needed in C#), basically their while loop locks up so tight on some machines the IsBusy state is never set. It also means that the application is always running 100% processor use whenever a service call was made.
We have fixed the problem by adding Thread.Sleep(100) in all the service call while loops. e.g.:
while (userState.IsBusy)
{
Thread.Sleep(100);
}
Our app is now working on all Silverlight capable machines (as it should) and is using a lot less processor to boot.
To be fair we are not using the very latest release of sculpture, but it was quite suprising to see such a silly mistake in a commercial package.

Related

Any limitations creating processes under Azure Web Sites (specifically Web Jobs)?

Are there any limitations on creating separate processes from an Azure Web Site (specifically, from a continuous Web Job)? I have an executable that often (about %20 of the time) stalls and eventually fails with exit code -1073741819 (access denied? or access violation?), but only when run as a separate process. If this work is retried later, it eventually succeeds (usually on the first retry).
When instead I call this logic directly via a .NET method call (so within the same process and app domain), the code succeeds 100% of the time. The same code also always succeeds when run locally, even when it creates a separate process.
Is there anything going on at the Azure Web Sites/Web Jobs level that I should be aware of, such as using Windows job objects or other security mechanisms to limit the creation or runtime of spawned processes? If not, any suggestions on how to diagnose further what might be going wrong? (I believe remote desktop to a web site isn't possible; anything else that would help "see" what's failing, such as whether there's a WER dialog appearing?)
In case it matters, the logic (in both cases) includes P/Invoking custom native code, and the web site I'm using is Always On, x64, Basic pricing tier.
#David Ebbo, thanks for the suggestion. I used it to help isolate, and I ultimately found this was non-determinism in the code made more likely in the Azure Web Sites environment but not 100% restricted to that context.

COM Runtime Breakdown in Multithreaded Server Application

We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.
Here's what we have:
A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit).
The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.
What is happening:
After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:
RPC_E_INVALID_HEADER (0x80010111) - "OLE received a packet with an invalid header" returned to the caller on cross-process calls to server component methods
Calls to CoCreateInstance (CCI) fail for the CLSCTX_LOCAL_SERVER context
CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)
All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.
The overall system remains responsive, SQL Server works, no signs of problems outside of our service process.
System resources are OK, no memory leaks, no abnormal CPU usage, no thrashing
The only remedy is to restart the broken service.
Other (related) observations:
The number of cores on the CPU has an adverse effect - a six core Xeon box fails after roughly 5 days, smaller boxes take 3 weeks or longer.
.NET Interop might be involved, as running a lot of calls accross interop from .NET clients to unmanaged COM server components also adversely affects the system.
Switching on the tracing code inside the server process prolongs the working time to the next failure.
Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.
Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes.
Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?
We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.
Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)
Here's how we identified the cause:
Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl
The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...
An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.
By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.

After enabling IIS 7.5 autostart, first request is still slow

I've set startMode="AlwaysRunning" attribute on my application pool and serviceAutoStartEnabled="true" attribute on my application in IIS configuration. I've even set up serviceAutoStartProvider and can see that "warm up" code is being executed. I also can see that w3wp process auto-starts after iisreset. Still, the first request to my ASP.NET MVC application is exactly as slow as without auto-start. Is there any point I'm missing or any way to easily debug this without a profiler?
Is this feature expected to affect first request performance at all? What is actually the bulk of work to do on the first request, given that the worker process is ready, .NET appdomain and even all .NET assemblies have been loaded?
I've been looking into this recently.
As far as I can tell, the autoStart feature will cause your IIS worker threads (by default, just the one for the pool) to JIT compile before the first request.
However, what is compiled appears to be just a bulk of the assemblies and dependencies, but not necessarily any methods.
When that first request happens, and your methods you've written get called for the first time, the JITer performs a final compile on those methods that have not yet been compiled.
The benefit of autoStart appears to be it lets .Net do 90% of the work up-front, but the last 10% is still paid for when the first request happens and those methods that were yet to be accessed get run for the first time.

Creating objects suddenly begins failing after they have been loaded in memory successfully

Behavior:
Application is loaded and being used as expected.
Suddenly, a particular DLL can no longer be loaded. The error message is:
ActiveX component cannot create object.
In each case, the object had been created successfully many times before failure. All objects are marked for "retain in memory".
This error is cleared when the application pool is recycled. It may be hours or months before it is seen again.
Issue has happened within two hours of a refresh, as well as never happened in months of uptime.
Issue has happened with hundreds of simultaneous users (heavy usage) and also with 1-3 users.
While the issue is occurring, the process running that application pool cannot create the object that is failing. However it can create any other objects. Memory, CPU, and other resources all remain at normal usage. In addition, other processes (such as a stand-alone exe) can successfully create the object.
The first instance of the issue appeared in mid 2008. There have been less than fifty instances since then, despite a pool of hundreds of servers for it to occur on. All instances except one have failed on the same DLL.
DLL Failure Info:
most common - generic data structure implementing a b-tree, has no references other than to its interface. Code consists of arrays and one use of the vb6 Event functionality. The object has not been changed in any way since 2005.
one-time - interop to a .NET module. the failure is occurring when trying to create the interop object, not the .NET object. This object is updated a few times each year.
Application Environment:
IIS hosted application
VB6, classic ASP, some interop to minor .NET components
Windows Server 2003 / Windows Server 2008 (both have independently had the problem)
Attempts to Reproduce:
Using scripts (and real-life humans) to run the same end-user workflows that our logs reported the days before the issue occurred.
Using scripts to create/destroy suspected objects as fast as possible from multiple simultaneous sessions.
Wild speculation.
No intentional success, but it does manifest randomly on the servers on its own.
Troubleshooting:
Code reviews
Test harnesses to investigate upper limits of object creation / destruction
Verification of ability to create object outside of the process experiencing the issue
Monitoring of resources over time on servers under load
Review of IIS, error, and event logs to determine events leading up to issue
Questions:
Any ideas on how to reproduce the issue?
What could cause this behavior?
Ideas for bypassing the first two questions in favor of a fast solution?
The DLL isn't on a network drive is it? You can get "glitches" where the drive is not available momentarily that then means COM can't do what it needs and could then fail to notice the drive is available again.
I used Process Monitor to debug similar problem when accessing ADO/OLEDB stack. Turned out environment got corrupted at some point and ADO classes are registered with InprocServer32 being REG_EXPAND_SZ pointing to %CommonProgramFiles%\System\ado\msado15.dll or similar ot x64 OSes.
Also when you register an application with Restart Manager, on failure the process gets restarted by winlogon process whose environment is different than explorer's one and unfortunately is missing %CommonProgramFiles% -- ouch!
This seems like a random failure; some race condition.
Try VMWARE to record the state of the machine you run this dll on. When the error happens you can then replay the record and inspect the memory contents. That why you won't have to play try and catch the error. At least you will have a solid record of it.
While I can't provide a solution, try catching the error and retry loading the dll when this happens after a refresh to the environment.

How to simulate a Windows Azure VM crash in my DevAppFabric

We need to think big and our applications need to scale in order to work on the Windows Azure Platform. But how do I simulate a crash of one of the VMs running my application?
I want to see (debug) how my application behaves in such environment.
Simulating faults is simple (just call Thread.Abord()); but it won't tell you much about your design.
In particular, debugging is a bit irrelevant, because whenever VM stop working there is nothing more to observe (nothing more to debug too). You should just assume that your app is likely to be abruptly stopped at any point of its execution.
Since, you cannot realistically observe all the subtle data corruptions that could be caused by interrupted executions, you should think to your persistence design to be resilient to such problem from the start (idempotent processes help a lot when possible).

Resources