Apparent classloader leak in Play! 2.3.4 - multithreading

I have walked several heap snapshots of a Play 2.3.4 application in JProfiler and VisualVM (ironically VisualVM seems to be more helpful) and have found that upon Play! reload classloaders are not properly replaced and several copies of old classloaders retaining old instances of old classes are retained in memory. After several application reloads the application crashes with an out of memory error (heap consistently exhausts before permanent generation probably due to the memory-intensive nature of the application).
While tracking down GC roots, I have found the following implicating evidence. There are 4 instances of PlayRun$$anonfun$10$$anon$2 with live objects, and as I understand there should only ever be 1 particularly noted by the observation that each of these classloader instances contains duplicate copies of my application's classes:
GC Roots of 4 PlayRun$$anonfun$10$$anon$2 instances retained in memory by thread stack frames:
contextClassLoader of BoneCP-keep-alive-scheduler
inheritedAccessControlContext of play-akka.actor.default-dispatcher-27
contextClassLoader of play-akka.actor.default-dispatcher-32
contextClassLoader of pool-16-thread-3 (java.util.concurrent executor service?)
Why are these other threads retaining references to obsolete Play! application classloaders? Doesn't Play! shutdown dependent threads like these to safeguard against this? Is it possible that some phase of the reload process failed to execute properly resulting in this bad object retention state?
The application is built on top of Play! 2.3.4 and SBT 0.13.6. This problem did not occur prior to upgrading from Play! 2.2.2 / SBT 0.13.1.

Related

What is starting all of these threads in my Spring application

I just started working on a legacy Spring Boot application, and I noticed it was not shutting down smoothly, required a hard kill to close it.
I did a thread dump and I see Spring is launching lots of threads that just don't want to die.
Ok, says I, must be the #Async's and #EnableAsync, we have a number of those to handle the initialization. I removed all of those, but no change.
Then I thought it might be Micrometer - we do a lot of instrumentation using #Timed, but I removed those and no change to the thread dump.
I searched for any instances of any kind of executor but nothing turns up.
What could be starting all of these threads and QuartzScheduler_Workers?

jvm full gc can't unload classes even permgen is full

Our production server went OOM because permgen is full. Using jmap -permstat to check the permgen area, we found there were many classes loaded by com.sun.xml.ws.client.WSSServiceDelegatingLoader. The loaded classes are com.sun.proxy.$ProxyXXX, where XXX is an int sequence.
the stacktrace for these classloading is as follow:
eventually, the JVM went OOM, full gc can't reclaim any permgen memory.
What is strange is that if I click System GC in VisualVM, the classes are unloaded and the usage of permgen goes down.
Our JDK version is 1.7.0.80, we have added CMSClassUnloadingEnabled.
-XX:+ExplicitGCInvokesConcurrent
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseParNewGC
-XX:+CMSParalledlRemarkEnabled
-XX:+UseCMSCompactAtFullCollection
-XX:+CMSFullGCsBeforeCompaction=0
-XX:+CMSCLassUnloadingEnabled
-XX:MaxTenuringThreshold=18
-XX:+UseCMSInitialtingOccupancyOnly
-XX:SurvivorRatio=4
-XX:ParallecGCThreads=16
Our code has been running for a long time. The most recent operation is a WebLogic patch. This really confused me. Could someone give me some help with this issue, many thanks!
This is a known bug https://github.com/javaee/metro-jax-ws/issues/1161
Every time a JAX-WS client is created, for instance, using library JAX-WS RI 2.2 which is bundled in Weblogic Server 12.1.3
com.sun.xml.ws.client.WSServiceDelegate$DelegatingLoader#1
Client proxy classes are being loaded into classloader:
([Loaded com.sun.proxy.$Proxy979 from com.sun.xml.ws.client.WSServiceDelegate$DelegatingLoader] )
Solution/Workaround:
Replace JAX-WS client where this bug is solved.

Jboss-6.1 Application running very slow

my application is running on jboss 6.1, and after few days my applications runs very slow., this is the situation I am facing every week,. for this I am killing java and clearing the temp and work folders and restarting the jboss again. Is there any other ways to clean the memory / manage the application. Kindly give me the suggestions for Linux and windows platforms.
Kindly help any one.
Thanks & Regards,
Sharath
Based on your RAM size of the system you can increase following parameters in run.conf(for linux) or run.conf.bat(for windows):
XMS, XMX, MaxPermSize.
-Xms512M -Xmx1024M -XX:MaxPermSize=128M
The flag Xmx specifies the maximum memory allocation pool for a Java Virtual Machine (JVM), while Xms specifies the initial memory allocation pool.
MaxPermSize are used to set size for Permanent Generation
The Permanent Generation is where class files are kept. These are the result of compiled classes and jsp pages. If this space is full, it triggers a Full Garbage Collection. If the Full Garbage Collection cannot clean out old unreferenced classes and there is no room left to expand the Permanent Space, an Out‐of‐ Memory error (OOME) is thrown and the JVM will crash
Hope you are aware of these three flags.

Is there a way to fail an automated test upon Netty leak detection?

I'm using Netty 4.0.x on a project where a core separate project will create ByteBuf buffers and pass them to the client code layer, which should be responsible for closing the buffer.
I've found leaks in some cases and I'd like to cover the codepath leading to those leaks with an automated test, but Netty's ResourceLeakDetector seems to only report leaks inside logs.
Is there a way to fail an automated JUnit test in the event of such a leak? (eg by plugin some behavior in the ResourceLeakDetector)?
Thanks!
PS: Keep in mind that my test wouldn't really create the buffers, the core code (which is a dependency) does.
Netty's CI server leak build will indicate that the build is unstable if leaks are detected. I'm not sure on the mechanism for this but it is possible (possibly detecting the leak messages from build logs) to automatically detect.
Keep in mind that the leaks are detected when the ByteBuf objects are GC. So when/where a leak is detected may be far removed from when/where the leak actually occurred. Netty does include a trail of places where the ByteBuf has been accessed in meaningful ways to help you trace back where the buffer was originally allocated, and potentially a list of objects who may be responsible for releasing that buffer.
If you are OK with the above limitations you could add a throw of an exception into ResourceLeakDetector for private use.

COM Runtime Breakdown in Multithreaded Server Application

We are experiencing intermittent catastrophic failures of the COM runtime in a large server application.
Here's what we have:
A server process running as a Windows service hosts numerous free-threaded COM components written in C++/ATL. Multiple client processes written in C++/MFC and .NET use these components via cross-procces COM calls (incl .NET interop) on the same machine. The OS is Windows Server 2008 Terminal Server (32-bit).
The entire software suite was developed in-house, we have the source code for all components. A tracing toolkit writes out errors and exceptions generated during operation.
What is happening:
After some random period of smooth sailing (5 days to 3 weeks) the server's COM runtime appears to fall apart with any combination of these symptoms:
RPC_E_INVALID_HEADER (0x80010111) - "OLE received a packet with an invalid header" returned to the caller on cross-process calls to server component methods
Calls to CoCreateInstance (CCI) fail for the CLSCTX_LOCAL_SERVER context
CoInitializeEx(COINIT_MULTITHREADED) calls fail with CO_E_INIT_TLS (0x80004006)
All in-process COM activity continues to run, CCI works for CLSCTX_INPROC_SERVER.
The overall system remains responsive, SQL Server works, no signs of problems outside of our service process.
System resources are OK, no memory leaks, no abnormal CPU usage, no thrashing
The only remedy is to restart the broken service.
Other (related) observations:
The number of cores on the CPU has an adverse effect - a six core Xeon box fails after roughly 5 days, smaller boxes take 3 weeks or longer.
.NET Interop might be involved, as running a lot of calls accross interop from .NET clients to unmanaged COM server components also adversely affects the system.
Switching on the tracing code inside the server process prolongs the working time to the next failure.
Tracing does introduce some partial synchronization and thus can hide multithreaded race condition effects. On the other hand, running on more cores with hyperthreading runs more threads in parallel and increases the failure rate.
Has anybody experienced similar behaviour or even actually come accross the RPC_E_INVALID_HEADER HRESULT? There is virtually no useful information to be found on that specific error and its potential causes.
Are there ways to peek inside the COM Runtime to obtain more useful information about COM's private resource pool usage like memory, handles, synchronization primitives? Can a process' TLS slot status be monitored (CO_E_INIT_TLS)?
We are confident to have pinned down the cause of this defect to a resource leak in the .NET framework 4.0.
Installations of our server application running on .NET 4.0 (clr.dll: 4.0.30319.1) show the intermittent COM runtime breakdown and are easily fixed by updating the .NET framework to version 4.5.1 (clr.dll: 4.0.30319.18444)
Here's how we identified the cause:
Searches on the web turned up an entry in an MSDN forum: http://social.msdn.microsoft.com/Forums/pt-BR/f928f3cc-8a06-48be-9ed6-e3772bcc32e8/windows-7-x64-com-server-ole32dll-threads-are-not-cleaned-up-after-they-end-causing-com-client?forum=vcmfcatl
The OP there described receiving the HRESULT RPC_X_BAD_STUB_DATA (0x800706f7) from CoCreateInstanceEx(CLSCTX_LOCAL_SERVER) after running a COM server with an interop app for some length of time (a month or so). He tracked the issue down to a thread resource leak that was observable indirectly via an incrementing variable inside ole32.dll : EventPoolEntry::s_initState that causes CCI to fail once its value becomes 0xbfff...
An inspection of EventPoolEntry::s_initState in our faulty installations revealed that its value started out at approx. 0x8000 after a restart and then constantly gained between 100 and 200+ per hour with the app running under normal load. As soon as s_initState hit 0xbfff, the app failed with all the symptoms described in our original question. The OP in the MSDN forum suspected a COM thread-local resource leak as he observed asymmetrical calls to thread initialization and thread cleanup - 5 x init vs. 3 x cleanup.
By automatically tracing the value of s_initState over the course of several days we were able to demonstrate that updating the .NET framework to 4.5.1 from the original 4.0 completely eliminates the leak.

Resources