We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.
Here's what we have:
A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit).
The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.
What is happening:
After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:
RPC_E_INVALID_HEADER (0x80010111) - "OLE received a packet with an invalid header" returned to the caller on cross-process calls to server component methods
Calls to CoCreateInstance (CCI) fail for the CLSCTX_LOCAL_SERVER context
CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)
All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.
The overall system remains responsive, SQL Server works, no signs of problems outside of our service process.
System resources are OK, no memory leaks, no abnormal CPU usage, no thrashing
The only remedy is to restart the broken service.
Other (related) observations:
The number of cores on the CPU has an adverse effect - a six core Xeon box fails after roughly 5 days, smaller boxes take 3 weeks or longer.
.NET Interop might be involved, as running a lot of calls accross interop from .NET clients to unmanaged COM server components also adversely affects the system.
Switching on the tracing code inside the server process prolongs the working time to the next failure.
Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.
Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes.
Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?
We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.
Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)
Here's how we identified the cause:
Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl
The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...
An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.
By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.
Related
Goal
Determine the cause of the sporadic lock ups of our web application running on IIS.
Problem
An application we are running on IIS sporadically locks up throughout the day. When it locks up it will lock up on all workers and on all load balanced instance.
Environment and Application
The application is running on 4 different Windows Server 2016 machines. The machines are load balanced using ha-proxy using a round robin load balancing scheme. The IIS application pools this website is hosted in are configured to have 4 workers each and the application it hosts is a 32-bit application. The IIS instances are not using a shared configuration file but the application pools for this application are all configured the same.
This application is the only application in the IIS application pool. The application is an ASP.NET web API and is using .NET 4.6.1. The application is not creating threads of its own.
Theory
My theory for why this is happening is that we have requests that are coming in that are taking ~5-30 minutes to complete. Every machine gets tied up servicing these requests so they look "locked up". The company rolled their own logging mechanism and from that I can tell we have requests that are taking ~5-30 minutes to complete. The team responsible for the application has cleaned up many of these but I am still seeing ~5 minute requests in the log.
I do not have access to the machines personally so our systems team has gotten memory dumps of the application when this happens. In the dumps I generally will see ~50 threads running and all of them are in our code. These threads will be all over our application and do not seem to be stopped on any common piece of code. When the application is running correctly the dumps have 3-4 threads running. Also I have looked at performance counters like the ASP.NET\Requests Queued but it never seems to have any requests queued. During these times the CPU, Memory, Disk and Network usage look normal. Using windbg none of the threads seem to have a high CPU time other than the finalizer thread which as far as I know should live the entire time.
Conclusion
I am looking for a means to prove or disprove my theory as to why we are locking up as well as any metrics or tools I should look at.
So this issue came down to our application using query in stitch on a table with 2,000,000 records in it to another table. Memory would become so fragmented that the Garbage Collector was spending more time trying to find places to put objects and moving them around than it was running our code. This is why it appeared that our application was still working and why their was no exceptions. Oddly IIS would time out the requests but would continue processing the threads.
I am developing an application level VSTO 4 Addin for Microsoft Excel 2007 / 2010.
The result is a windows forms based DLL using .Net 4 Client Profile.
Now I have to use a legacy COM-DLL. It is no problem to set the reference and access the COM-Methods via COM-Interop from .Net.
But the the (synchronous) method I need to call can take a minute or longer to get back.
I know your answer:
Use a worker thread...
I have used The Task Parallel Library to put the long lasting operation in a worker task and keep the GUI (Excel) responding.
But: The inprocess COM-Call (in the worker task/thread) still seems to block my GUI-Thread.
Why? Is it because Excel is always running as STA (Single Thread
Apartment)?
How can I keep the Excel GUI responding?
Is there a way to make it really asynchronous?
Thanks for any answers,
Jörg
Finally, I've found an answer to this topic:
I've readed a lot about COM Threading Models and then spoke to the developer of the COM-DLL I am calling as an InProc-Server.
Together we changed the threading model of the COM-DLL:
OLD (blocking): Single-Threaded Apartment (STA), (ThreadingModel=Apartment)
NEW (working): Multi-Threaded Apartment (MTA), (ThreadingModel=Free)
Since we have our own synchronization mechanisms in the COM-DLL, there are no problems caused by the missing synchronization via the standard Windows message queue.
Problem was, that even the UI Thread was idle and even if it did DoEvents, the important windows messages (WM_Paint, etc.) were not delivered.
Now they are. The UI is responding at every time and the call to the COM-DLL is still done in a worker thread (as mentioned above, it's a ThreadPool thread which is used by the Task Parallel Library).
Behavior:
Application is loaded and being used as expected.
Suddenly, a particular DLL can no longer be loaded. The error message is:
ActiveX component cannot create object.
In each case, the object had been created successfully many times before failure. All objects are marked for "retain in memory".
This error is cleared when the application pool is recycled. It may be hours or months before it is seen again.
Issue has happened within two hours of a refresh, as well as never happened in months of uptime.
Issue has happened with hundreds of simultaneous users (heavy usage) and also with 1-3 users.
While the issue is occurring, the process running that application pool cannot create the object that is failing. However it can create any other objects. Memory, CPU, and other resources all remain at normal usage. In addition, other processes (such as a stand-alone exe) can successfully create the object.
The first instance of the issue appeared in mid 2008. There have been less than fifty instances since then, despite a pool of hundreds of servers for it to occur on. All instances except one have failed on the same DLL.
DLL Failure Info:
most common - generic data structure implementing a b-tree, has no references other than to its interface. Code consists of arrays and one use of the vb6 Event functionality. The object has not been changed in any way since 2005.
one-time - interop to a .NET module. the failure is occurring when trying to create the interop object, not the .NET object. This object is updated a few times each year.
Application Environment:
IIS hosted application
VB6, classic ASP, some interop to minor .NET components
Windows Server 2003 / Windows Server 2008 (both have independently had the problem)
Attempts to Reproduce:
Using scripts (and real-life humans) to run the same end-user workflows that our logs reported the days before the issue occurred.
Using scripts to create/destroy suspected objects as fast as possible from multiple simultaneous sessions.
Wild speculation.
No intentional success, but it does manifest randomly on the servers on its own.
Troubleshooting:
Code reviews
Test harnesses to investigate upper limits of object creation / destruction
Verification of ability to create object outside of the process experiencing the issue
Monitoring of resources over time on servers under load
Review of IIS, error, and event logs to determine events leading up to issue
Questions:
Any ideas on how to reproduce the issue?
What could cause this behavior?
Ideas for bypassing the first two questions in favor of a fast solution?
The DLL isn't on a network drive is it? You can get "glitches" where the drive is not available momentarily that then means COM can't do what it needs and could then fail to notice the drive is available again.
I used Process Monitor to debug similar problem when accessing ADO/OLEDB stack. Turned out environment got corrupted at some point and ADO classes are registered with InprocServer32 being REG_EXPAND_SZ pointing to %CommonProgramFiles%\System\ado\msado15.dll or similar ot x64 OSes.
Also when you register an application with Restart Manager, on failure the process gets restarted by winlogon process whose environment is different than explorer's one and unfortunately is missing %CommonProgramFiles% -- ouch!
This seems like a random failure; some race condition.
Try VMWARE to record the state of the machine you run this dll on. When the error happens you can then replay the record and inspect the memory contents. That why you won't have to play try and catch the error. At least you will have a solid record of it.
While I can't provide a solution, try catching the error and retry loading the dll when this happens after a refresh to the environment.
We have an application generated using the Sculpture software package. That means the project is roughly equivalent to the code in a Prism application.
Part of their model is that all WCF Service calls are performed synchronously, but on background threads (actually they are async calls as well, but the Sculpture background thread methods wait around for the response before executing any following code).
When we deployed the application, we found that around 50% of all machines tested would not get past the first service call. We cannot see any pattern in the machines that fail as they are have a mixture of both Debug and Release Silverlight runtime and Windows 7 on machines that work as well as fail. It fails the same on different browser so is machine specific. The only clue is they all seem to be older PCs.
Ideas anyone?
Found the cause. There is a schoolboy error in their generated service calls.
What's wrong with this picture?:
while (true == userState.IsBusy)
{}
Ignoring the old-school use of true == (not needed in C#), basically their while loop locks up so tight on some machines the IsBusy state is never set. It also means that the application is always running 100% processor use whenever a service call was made.
We have fixed the problem by adding Thread.Sleep(100) in all the service call while loops. e.g.:
while (userState.IsBusy)
{
Thread.Sleep(100);
}
Our app is now working on all Silverlight capable machines (as it should) and is using a lot less processor to boot.
To be fair we are not using the very latest release of sculpture, but it was quite suprising to see such a silly mistake in a commercial package.
IIS 6.0 Hangs, then the app pool resets after approximately 3 minutes. This is an ASP site, upon reset it functions great for a few seconds, then hangs. All other App Pools on this instance of IIS 6 function correctly. There do not appear to be any performance issues with this machine. I took a memory dump using IIS Debug Diagnostics, and this is the rendered analysis. Can anyone please lend some support?
Analysis Summary Type Description Recommendation Warning
Detected possible blocking or leaked critical section at
ntdll!LdrpLoaderLock owned by thread 24 in
w3wp.exe__SupportSiteAppPool__PID__3960__Date__07_23_2009__Time_02_22_36PM__551__Manual
Dump.dmp
Impact of this lock
66.67% of executing ASP Requests blocked
22.58% of threads blocked
(Threads 6 22 23 27 28 29 30)
The following functions are trying to enter this critical section
ntdll!LdrLockLoaderLock+133
ntdll!LdrpGetProcedureAddress+128
ntdll!LdrpInitializeThread+68
The following module(s) are involved with this critical section
C:\WINDOWS\system32\ntdll.dll from Microsoft Corporation The
entry-point function for a dynamic link library (DLL) should perform
only simple initialization or termination tasks, however this thread
(24) is loading a dll using the LoadLibrary API. Follow the guidance
in the MSDN documentation for DllMain to avoid access violations and
deadlocks while loading and unloading libraries.
Please follow up with the vendor Microsoft Corporation for
C:\WINDOWS\system32\mscoree.dll
Warning Detected possible blocking or leaked critical section at asp!g_ViperReqMgr+2c owned by thread 8 in
w3wp.exe__SupportSiteAppPool__PID__3960__Date__07_23_2009__Time_02_22_36PM__551__Manual
Dump.dmp
Impact of this lock
6.45% of threads blocked
(Threads 7 9)
The following functions are trying to enter this critical section
asp!CViperActivity::PostAsyncRequest+72
The following module(s) are involved with this critical section
\?\C:\WINDOWS\system32\inetsrv\asp.dll from Microsoft Corporation
The following vendors were identified for follow up based on root
cause analysis
Microsoft Corporation
Please follow up with the vendors identified above Consider the
following approach to determine root cause for this critical section
problem: Enable 'lock checks' in Application Verifier Download
Application Verifier from the following URL:
Microsoft Application Verifier Enable 'lock checks' for this process by running the following command:
Appverif.exe -enable locks -for w3wp.exe See the following document for more information on Application Verifier:
Testing Applications with AppVerifier Use a DebugDiag crash rule to monitor the application for exceptions
Your ASP Classic App is failing because all threads are blocked. I suggest running Process Monitor on the web server to see what handles are taken up where. I don't see a lot of repetition in your stack trace that would indicate a problem with a particular dll.
Given the information provided it sounds like a problem with the application itself rather than IIS. Have you made sure there aren't any crazy tight loops or excessive/extremely heavy DB loads, possibly some PInvoke calls or just something out of the ordinary for a webapp that are killing the application/runtime and causing the pool to die?
I think you should try some tools likes fiddler and other things.With that you can have exact idea what is taking time to load your site. From the log it seems that there is problem with the application itself. So don't use excessive loops, cache data from db and use and also don't store large object in session or application.