We have some web application running on tomcat8.
Here is system info:
Tomcat Version: Apache Tomcat/8.0.32 (Ubuntu)
JVM Version: 1.8.0_121-b13
JVM Vendor: Oracle Corporation
OS Name: Linux
OS Version: 4.4.0-104-generic
OS Architecture: amd64
All time (after rest controller function call, redeploy, restart etc...) there are errors all of them related to unstopped threads appear in the log:
WARNING: The web application [newlps-1.0] appears to have started a thread named [CleanCursors-1-thread-1] but has failed to stop it. This is very likely to create a memory leak.
INFO: Illegal access: this web application instance has been stopped already. Could not load [com.mongodb.connection.ClusterDescription]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
The warning are appear for several thread names, not just one...
I have much to search over the internet about, and really doesn't found nothing really answers what to do...
I suppose that:
1. Thread need to "exit" after it finished.
2. There need be GC for clean old non referenced objects.
Why it doesn't works so?
Related
We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.
Here's what we have:
A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit).
The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.
What is happening:
After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:
RPC_E_INVALID_HEADER (0x80010111) - "OLE received a packet with an invalid header" returned to the caller on cross-process calls to server component methods
Calls to CoCreateInstance (CCI) fail for the CLSCTX_LOCAL_SERVER context
CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)
All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.
The overall system remains responsive, SQL Server works, no signs of problems outside of our service process.
System resources are OK, no memory leaks, no abnormal CPU usage, no thrashing
The only remedy is to restart the broken service.
Other (related) observations:
The number of cores on the CPU has an adverse effect - a six core Xeon box fails after roughly 5 days, smaller boxes take 3 weeks or longer.
.NET Interop might be involved, as running a lot of calls accross interop from .NET clients to unmanaged COM server components also adversely affects the system.
Switching on the tracing code inside the server process prolongs the working time to the next failure.
Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.
Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes.
Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?
We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.
Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)
Here's how we identified the cause:
Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl
The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...
An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.
By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.
I have strange problem with my multi threaded server. It is Windows service and works similar to FTP server managing socket connection to many clients. It was created using Delphi 2006 (Turbo Delphi) and works well on most machines. Unfortunately on some machines it sometimes crashes without any trace from itself (exception should be saved to log, but are not). Sometimes system shows MessageBox (it is not MessageBox from service, but I think it is system MessageBox), but most often I see such information in System EventLog:
Application popup: ht_switch.exe - Application Error : The exception unknown software exception (0x0eedfade) occurred in the application at location 0x77e4bef7.
In Application EventLog I can see:
Faulting application ht_switch.exe, version 1.2.0.2, faulting module kernel32.dll, version 5.2.3790.5069, fault address 0x0000bef7.
Sometimes such entries are in Application or System EventLog, but nothing happens -- my server works as usually, but sometimes is simply disappears. Then Service Manager reports in EventlLog that my service unexpectedly stopped.
I see no "common" scenario to such problem. It appears on some WinXP, Win2003 and Win2008. All test machines have all MS patches applied.
I have read answers to: 0x0eedfade kernelbase.dll faulting module in d7 windows service but I do not use Dialog unit.
What can I do to repair it? How to trace such 0x0eedfade exception?
EDIT
I tested for some days my server with both EurekaLog and madExcept.
EurekaLog:
Server works without problem. No exception is reported in EventLog. No exception is reported in %AppData%\EurekaLab s.a.s\EurekaLog\Bug Reports\ (there should be directory for my program, but it was not created -- I don't know if it should be created or if it is an EurekaLog error).
EurekaLog7 have problem with setting "Application Type" to Windows Service. It is known problem and authors works on it. My service compiled with it works on WinXP but was not able to work on Win2003. It simply do not start.
madExcept:
Server worked for 4 hours and crashed. I have caught this exception in my thread:
EAccessViolation: Access violation at address 7C90100B in module 'ntdll.dll'. Read of address 00000018!!!
I haven't noticed any madExcept report on this exception. After this exception one thread was lost with socket in CLOSE_WAIT state (other side closed connection). Then I restarted my service and after next few hours it worked without problem.
disabled EurekaLog and madExcept:
After 10-30 minutes I see MessageBox with error. But 0x0eedfade error is cryptic and do not show me any hint on source of the problem. It is also very strange because after displaying such message service works without problem (most of the time).
Summary od exception interceptors:
EurekaLog and madExcept are probably good at exceptions raised by Delphi but it seems that change behavior of my service and error magically disappeared or they report exception to place I cannot find.
EDIT: Problem solved
After some debugging that lead me to nowhere (Call Stack with very strange places) I resigned from it and started to inspect lastly commited changes. One change was string operation where string (AnsiString) can be of length 64 or 128 (some kind of bit mask). I set 70th character of string that was earlier allocated with SetLength(buffer, 64). That was the problem. I think I would save time by enabling range checking.
How to trace such 0x0eedfade exception?
This is the code for a Delphi exception. Clearly you are raising a Delphi exception that is not being handled and that is bringing your process down.
You should add madExcept, EurekaLog, JCLDebug or similar to your process. These tools will produce diagnostics reports when your process fails. The most useful part of those reports will be the stack trace at the point of failure. You should be able then to work out where the failure occurs, at the very least, and this usually is enough to work out what is wrong with your code.
When i use Java VisualVM to monitor my JBoss Application.
It shows
Live Threads as: 155
Daemon Threads as: 135
When i use JMX Web Console of JBoss.
It shows
Current Busy Threads as: 40
Current Thread Count as: 60
Why is there so much discrepancy between what Java Visual VM is reporting and what JMX Web Console shows. (How is Live Threads different from Busy Threads)
A live thread is one that exists and is not Terminated. (See Thread.State)
A busy thread is one that is actually working or, more precisely, Runnable.
JBoss's Web Console tends to report fewer threads because it is very non-invasive. In other words, it does not have to spawn additional threads just to render you a web page. It's already running a web server and it already allocated threads to handle web requests before you went into JMX Console.
Visual VM on the other hand, starts up several threads to support the JMX remoting (usually RMI) which comes with a little extra baggage. You might see extra threads like:
RMI TCP Connection(867)
RMI TCP Connection(868)
RMI TCP Connection(869)
JMX server connection timeout
Having said that, the discrepancy you are reporting is way out of line and makes me think that you're not looking at the same JVM.
The JMX Console is obvious :), so I would guess your Visual VM is connected elsewhere. See if you can correlate similar thread name (using the MBean jboss.system:type=ServerInfo listThreadDump operation), or browse the MBeans in Visual VM and inspect the JBoss MBeans. These mbeans are good ones to look at because they indicate a binding to a socket so they could not have the same values if they were not the same JVM process:
jboss.web:name=HttpRequest1,type=RequestProcessor,worker=http-0.0.0.0-18080
Of course, the other thing would be that if you start VisualVM first, have it running and then go to JMX Console and don't see as many threads, you're definitely in a different VM.
Cheers.
//Nicholas
We have a cherrypy service that integrates with several backend web services. During load testing cherrypy process is regularly crashed after a while (45 minutes). We know the bottleneck is the backend web services we are using. Before crashing we see 500 and 503 errors when accessing the backend services, but I can't figure why cherrypy itself will crash (the whole process was killed). Can you give me ideas how to investigate where the problem is? Is it possible that the thread_poll (50) is queueing up too many requests?
In my early CherryPy days I had it crashing once. I mean a Python process crash caused by a segfault. When I investigated it I found that I messed with MySQLdb connections, caching them in objects which were accessed by CherryPy threads interchangeably. Because a MySQLdb connection is not thread-safe it should be accessed only from the thread in was created in. Also because of concurrency involved the crashes seemed nondeterministic, and only appeared in load-testing. So load-testing can work as a debugging tool here -- try Apache JMeter or Locust (Pythonic).
When a process crashes you can instruct Linux to write a core dump which will have a stack trace (e.g. on MySQLdb C-code side in my example). However alien low-level C environment is to you (it is to me), the stack trace can help find what library is causing the crash or at least narrow a circle of suspects. Here is an article about it.
Also I want to note that unlikely problem is in CherryPy. It is actually very stable.
IIS 6.0 Hangs, then the app pool resets after approximately 3 minutes. This is an ASP site, upon reset it functions great for a few seconds, then hangs. All other App Pools on this instance of IIS 6 function correctly. There do not appear to be any performance issues with this machine. I took a memory dump using IIS Debug Diagnostics, and this is the rendered analysis. Can anyone please lend some support?
Analysis Summary Type Description Recommendation Warning
Detected possible blocking or leaked critical section at
ntdll!LdrpLoaderLock owned by thread 24 in
w3wp.exe__SupportSiteAppPool__PID__3960__Date__07_23_2009__Time_02_22_36PM__551__Manual
Dump.dmp
Impact of this lock
66.67% of executing ASP Requests blocked
22.58% of threads blocked
(Threads 6 22 23 27 28 29 30)
The following functions are trying to enter this critical section
ntdll!LdrLockLoaderLock+133
ntdll!LdrpGetProcedureAddress+128
ntdll!LdrpInitializeThread+68
The following module(s) are involved with this critical section
C:\WINDOWS\system32\ntdll.dll from Microsoft Corporation The
entry-point function for a dynamic link library (DLL) should perform
only simple initialization or termination tasks, however this thread
(24) is loading a dll using the LoadLibrary API. Follow the guidance
in the MSDN documentation for DllMain to avoid access violations and
deadlocks while loading and unloading libraries.
Please follow up with the vendor Microsoft Corporation for
C:\WINDOWS\system32\mscoree.dll
Warning Detected possible blocking or leaked critical section at asp!g_ViperReqMgr+2c owned by thread 8 in
w3wp.exe__SupportSiteAppPool__PID__3960__Date__07_23_2009__Time_02_22_36PM__551__Manual
Dump.dmp
Impact of this lock
6.45% of threads blocked
(Threads 7 9)
The following functions are trying to enter this critical section
asp!CViperActivity::PostAsyncRequest+72
The following module(s) are involved with this critical section
\?\C:\WINDOWS\system32\inetsrv\asp.dll from Microsoft Corporation
The following vendors were identified for follow up based on root
cause analysis
Microsoft Corporation
Please follow up with the vendors identified above Consider the
following approach to determine root cause for this critical section
problem: Enable 'lock checks' in Application Verifier Download
Application Verifier from the following URL:
Microsoft Application Verifier Enable 'lock checks' for this process by running the following command:
Appverif.exe -enable locks -for w3wp.exe See the following document for more information on Application Verifier:
Testing Applications with AppVerifier Use a DebugDiag crash rule to monitor the application for exceptions
Your ASP Classic App is failing because all threads are blocked. I suggest running Process Monitor on the web server to see what handles are taken up where. I don't see a lot of repetition in your stack trace that would indicate a problem with a particular dll.
Given the information provided it sounds like a problem with the application itself rather than IIS. Have you made sure there aren't any crazy tight loops or excessive/extremely heavy DB loads, possibly some PInvoke calls or just something out of the ordinary for a webapp that are killing the application/runtime and causing the pool to die?
I think you should try some tools likes fiddler and other things.With that you can have exact idea what is taking time to load your site. From the log it seems that there is problem with the application itself. So don't use excessive loops, cache data from db and use and also don't store large object in session or application.