How can I know who call System.gc() in spark streaming program?

How can I know who call System.gc() in spark streaming program? - garbage-collection

The GC time is too long in my spark streaming programme. In the GC log, I found that Someone called System.gc() in the programme. I do not call System.gc() in my code. So the caller should be the api I used.
I add -XX:-DisableExplicitGC to JVM and fix this problem. However, I want to know who call the System.gc().
I tried some methods.
Use jstack. But the GC is not so frequent, it is difficult to dump the thread that call the method.
I add trigger that add thread dump when invoke method java.lang.System.gc() in JProfiler. But it doesn't seem to work.
How can I know who call System.gc() in spark streaming program?

You will not catch System.gc with jstack, because during stop-the-world pauses JVM does not accept connections from Dynamic Attach tools, including jstack, jmap, jcmd and similar.
It's possible to trace System.gc callers with async-profiler:
Start profiling beforehand:
$ profiler.sh start -e java.lang.System.gc <pid>
After one or more System.gc happens, stop profiling and print the stack traces:
$ profiler.sh stop -o traces <pid>
Example output:
--- Execution profile ---
Total samples : 6
Frame buffer usage : 0.0007%
--- 4 calls (66.67%), 4 samples
[ 0] java.lang.System.gc
[ 1] java.nio.Bits.reserveMemory
[ 2] java.nio.DirectByteBuffer.<init>
[ 3] java.nio.ByteBuffer.allocateDirect
[ 4] Allocate.main
--- 2 calls (33.33%), 2 samples
[ 0] java.lang.System.gc
[ 1] sun.misc.GC$Daemon.run
In the above example, System.gc is called 6 times from two places. Both are typical situations when JDK internally forces Garbage Collection.
The first one is from java.nio.Bits.reserveMemory. When there is not enough free memory to allocate a new direct ByteBuffer (because of -XX:MaxDirectMemorySize limit), JDK forces full GC to reclaim unreachable direct ByteBuffers.
The second one is from GC Daemon thread. This is called periodically by Java RMI runtime. For example, if you use JMX remote, periodic GC is automatically enabled once per hour. This can be tuned with -Dsun.rmi.dgc.client.gcInterval system property.

Related

windbg output from !thread?

Running windbg on a full memory dump. The !process command generates thread information (see below). Frequently the THREAD line is followed by multiple event-like things, like "fffffa800a0c0060 SynchronizationTimer". What do they signify? Are they objects the thread owns? Or is waiting on?
THREAD fffffa8005718b50 Cid 16c0.1660 Teb: 00000000fffd8000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Alertable
fffffa800a0c0060 SynchronizationTimer
fffffa800a7c1060 SynchronizationTimer
<etc...>
fffffa8007a9f4e0 SynchronizationEvent
fffffa800ae48b20 SynchronizationTimer
Not impersonating
DeviceMap fffff8a01480f1e0

A thread doesn't really own objects, so it has to be the latter.
The documentation doesn't say this, but it's mentioned, for example, here: How can I work out what events are being waited for with WinDBG in a kernel debug session

MPI_REDUCE causing memory leak

I have recently encountered a weir behavior. If I run the following code on my machine (using the most recent version of cygwin, Open MPI version 1.8.6) I get a linearly growing memory usage that quickly overwhelms my pc.
program memoryTest
use mpi
implicit none
integer :: ierror,errorStatus ! error codes
integer :: my_rank ! rank of process
integer :: p ! number of processes
integer :: i,a,b
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, p, ierror)
b=0
do i=1,10000000
a=1*my_rank
call MPI_REDUCE(a,b,1,MPI_INTEGER,MPI_MAX,0,MPI_COMM_WORLD,errorStatus)
end do
call MPI_Finalize(ierror)
stop
end program memoryTest
Any idea what the problem might be? The code looks fine to my beginner's eyes. The compilation line is
mpif90 -O2 -o memoryTest.exe memoryTest.f90

This has been discussed in a related thread here.
The problem is that the root process needs to receive data from other processes and perform the reduction while other processes only need to send the data to the root process. So the root process is running slower and it could be overwhelmed by the number of incoming messages. If you insert at MPI_BARRIER call after the MPI_REDUCE call then the code should run without a problem.
The relevant part of the MPI specification says: "Collective operations can (but are not required to) complete as soon as the caller's
participation in the collective communication is finished. A blocking operation is complete
as soon as the call returns. A nonblocking (immediate) call requires a separate completion
call (cf. Section
3.7
). The completion of a collective operation indicates that the caller is free
to modify locations in the communication buffer. It does not indicate that other processes
in the group have completed or even started the operation (unless otherwise implied by the
description of the operation). Thus, a collective communication operation may, or may not,
have the effect of synchronizing all calling processes. This statement excludes, of course,
the barrier operation."

To add a bit more support for macelee's answer: if you run this program to completion under MPICH with MPICH's internal memory leak tracing/reporting turned on, you see no leaks. Furthermore, valgrind's leak-check reports
==12866== HEAP SUMMARY:
==12866== in use at exit: 0 bytes in 0 blocks
==12866== total heap usage: 20,001,601 allocs, 20,000,496 frees, 3,369,410,210 bytes allocated
==12866==
==12866== All heap blocks were freed -- no leaks are possible
==12866==

Linux process execution history

I have a multi-threaded (three threads) application in Linux 3.4.0 with RT7 (realtime) patch. The application needs realtime execution with ~20ms tolerance.The application runs for a while (1 min to 50min) with realtime then I find that while one of the threads is doing some processing, a context switch happens and it comes back to the thread about 80 to 500ms later. I need to find out what process takes away the time slice. All my threads together consume ~5% CPU time. Is there any tool to see process execution history with time stamp?
Thanks,
Hakim

Consider using SystemTap. It is dynamic instrumentation engine inspired by DTrace. It dynamically patches kernel (so it will need debuginformation for it).
For example, your task may be achieved with this script:
probe scheduler.cpu_on, scheduler.cpu_off {
if(pid() == target()) {
printf("%ld %s\n", gettimeofday_us(), pn());
}
}
Use -c option to attach this script to a command or -x to a running PID:
root#lkdevel:~# stap -c 'dd if=/dev/zero of=/dev/null count=1' ./schedtrace.stp
...
1423701880670656 scheduler.cpu_on
1423701880673498 scheduler.cpu_off
1423701880674208 scheduler.cpu_on
1423701880689407 scheduler.cpu_off
1423701880689829 scheduler.cpu_on
...

JRuby - How to start the garbage collector?

I fired up my JRuby irb console and typed:
irb(main):037:0* GC.enable
(irb):37 warning: GC.enable does nothing on JRuby
=> true
irb(main):038:0> GC.start
=> nil
irb(main):039:0>
How can I manually enable or start the JVM garbage during a program?
I ask because I have a program which is needs to generate about 500 MBytes of test data and save it in MySQL. The program uses about 5 levels of nested loops, and it crashes with a JVM memory heap exception after generating about 100 MBytes of test data because there is no more heap memory. I would like to give let the garbage collector run after every run of the outer loop so that all the orphaned objects created in the inner loops can be cleaned up .

The exact answer to your question would be:
require 'java'
java_import 'java.lang.System'
# ...
System.gc()
though, bearing in mind even though the JVM usually does run the GC, it may or may not do it – very dependent on the JVM implementation. It can also be quite a hit on performance.
A better answer is obviously to ensure that at the end of the nested loop, no reference is held on the test data you are generating, so that they can indeed be reclaimed by the GC later on. Example:
class Foo; end
sleep(5)
ary = []
100_000.times { 100_000.times{ ary << Foo.new }; puts 'Done'; ary = [] }
If you run this with jruby -J-verbose:gc foo.rb, you should see the GC regularly claiming the objects; this is also quite clear using JVisualVM (the sleep in the example is to give some time to connect to the Jruby process in JVisualVM).
Lastly you can increase heap memory by adding the following flag: -J-Xmx256m; see the JRuby wiki for more details.
Edit: Coincidentally, here is a mindmap on GC tuning recently presented by Mario Camou at Madrid DevOps re-posted by Nick Sieger.

It's not possible because Gc will be run automatically by JVM. Make sure that you're creating objects only when it's required. Avoid creating class level objects and try to find out which of the objects is taking more memory and create it only when it's required.

640 enterprise library caching threads - how?

We have an application that is undergoing performance testing. Today, I decided to take a dump of w3wp & load it in windbg to see what is going on underneath the covers. Imagine my surprise when I ran !threads and saw that there are 640 background threads, almost all of which seem to say the following:
OS Thread Id: 0x1c38 (651)
Child-SP RetAddr Call Site
0000000023a9d290 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()
0000000023a9d2d0 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()
0000000023a9d330 000007fef727c978 Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()
0000000023a9d380 000007fef9001552 System.Threading.ExecutionContext.runTryCode(System.Object)
0000000023a9dc30 000007fef72f95fd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0000000023a9dc80 000007fef9001552 System.Threading.ThreadHelper.ThreadStart()
If i had to give a guess, I'm thinkign that one of these threads are getting spawned for each run of our app - we have 2 app servers, 20 concurrent users, and ran the test approximately 30 times...it's in the neighborhood.
Is this 'expected behavior', or perhaps have we implemented something improperly? The test ran hours ago, so i would have expected any timeouts to have occurred already.
Edit: Thank you all for your replies. It has been requested that more detail be shown about the callstack - here is the output of !mk from sosex.dll.
ESP RetAddr
00:U 0000000023a9cb38 00000000775f72ca ntdll!ZwWaitForMultipleObjects+0xa
01:U 0000000023a9cb40 00000000773cbc03 kernel32!WaitForMultipleObjectsEx+0x10b
02:U 0000000023a9cc50 000007fef8f5f595 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
03:U 0000000023a9ccf0 000007fef8f59f49 mscorwks!Thread::DoAppropriateAptStateWait+0x41
04:U 0000000023a9cd50 000007fef8e55b99 mscorwks!Thread::DoAppropriateWaitWorker+0x191
05:U 0000000023a9ce50 000007fef8e2efe8 mscorwks!Thread::DoAppropriateWait+0x5c
06:U 0000000023a9cec0 000007fef8f0dc7a mscorwks!CLREvent::WaitEx+0xbe
07:U 0000000023a9cf70 000007fef8fba72e mscorwks!Thread::Block+0x1e
08:U 0000000023a9cfa0 000007fef8e1996d mscorwks!SyncBlock::Wait+0x195
09:U 0000000023a9d0c0 000007fef9463d3f mscorwks!ObjectNative::WaitTimeout+0x12f
0a:M 0000000023a9d290 000007ff002321b3 *** ERROR: Module load completed but symbols could not be loaded for Microsoft.Practices.EnterpriseLibrary.Caching.DLL
Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()(+0x0 IL)(+0x11 Native)
0b:M 0000000023a9d2d0 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()(+0xf IL)(+0x18 Native)
0c:M 0000000023a9d330 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()(+0x9 IL)(+0x12 Native)
0d:M 0000000023a9d380 000007fef727c978 System.Threading.ExecutionContext.runTryCode(System.Object)(+0x18 IL)(+0x106 Native)
0e:U 0000000023a9d440 000007fef9001552 mscorwks!CallDescrWorker+0x82
0f:U 0000000023a9d490 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
10:U 0000000023a9d530 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
11:U 0000000023a9d790 000007fef8f0cbd2 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x12a
12:U 0000000023a9da20 000007fef945e572 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0x172
13:M 0000000023a9dc30 000007fef7261722 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)(+0x60 IL)(+0x51 Native)
14:M 0000000023a9dc80 000007fef72f95fd System.Threading.ThreadHelper.ThreadStart()(+0x8 IL)(+0x2a Native)
15:U 0000000023a9dcd0 000007fef9001552 mscorwks!CallDescrWorker+0x82
16:U 0000000023a9dd20 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
17:U 0000000023a9ddc0 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
18:U 0000000023a9e010 000007fef8f9ae8d mscorwks!ThreadNative::KickOffThread_Worker+0x191
19:U 0000000023a9e330 000007fef8f59374 mscorwks!TypeHandle::GetParent+0x5c
1a:U 0000000023a9e380 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
1b:U 0000000023a9e450 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
1c:U 0000000023a9e490 000007fef8e1c985 mscorwks!ILCodeStream::GetToken+0x25
1d:U 0000000023a9e4c0 000007fef8f594e1 mscorwks!Thread::DoADCallBack+0x145
1e:U 0000000023a9e630 000007fef8f59399 mscorwks!TypeHandle::GetParent+0x81
1f:U 0000000023a9e680 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
20:U 0000000023a9e750 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
21:U 0000000023a9e790 000007fef8e20e15 mscorwks!ThreadNative::KickOffThread+0x401
22:U 0000000023a9e7f0 000007fef8e20ae7 mscorwks!ThreadNative::KickOffThread+0xd3
23:U 0000000023a9e8d0 000007fef8f814fc mscorwks!Thread::intermediateThreadProc+0x78
24:U 0000000023a9f7a0 00000000773cbe3d kernel32!BaseThreadInitThunk+0xd
25:U 0000000023a9f7d0 00000000775d6a51 ntdll!RtlUserThreadStart+0x1d

Yes, the caching block has some - issues - with regard to the scavenger threads in older versions of Entlib, particularly if things are coming in faster than the scavenging settings let them come out.
This was completely rewritten in Entlib 5, so that now you'll never have more than two threads sitting in the caching block, regardless of the load, and usually it'll only be one.
Unfortunately there's no easy tweak to change the behavior in earlier versions. The best you can do is change the cache settings so that each scavenge will clean out more items at a time so not as many scavenge requests need to get scheduled.

640 threads is very bad for performance. If they are all waiting for something, then I'd say it's a fair bet that you have a deadlock and they will never exit. If they are all running (not waiting)... well, with 600+ threads on a 2 or 4 core processor none of them will get enough time slices to run very far! ;>
If your app is set up with a main thread that waits on the thread handles to find out when the threads exit, and the background threads get caught up in a loop or in a wait state and never exit the thread proc, then the process and all of its threads will never exit.
Check your thread code to make sure that every threadproc has a clear path to exit the threadproc. It's bad form to write an infinite loop in a background thread on the assumption that the thread will be forcibly terminated when the process shuts down.
If the background thread code spins in a loop waiting for an event handle to signal, make sure that you have some way to signal that event so that the thread can perform a normal orderly exit. Otherwise, you need to write the background thread to wait on multiple events and unblock when any one of the events signals. One of those events can be the activity that the background thread is primarily interested in and the other can be a shutdown event.
From the names of things in the stack dump you posted, it would appear that the thread is waiting for something to appear in the ProducerConsumerQueue. Investigate how that queue object is supposed to be shut down, probably on the producer side, and whether shutting down the queue will automatically release all consumers that are waiting on that queue.
My guess is that either the queue is not being shut down correctly or shutting it down does not implicitly release the consumers that are waiting on it. If the latter case, you may need to pump a terminate message through the queue to wake up all the consumers waiting on that queue and tell them to break out of their wait loop and exit.

You have an major issue. Every Thread occupies 1MB of stack and there is significant cost paid for Context Switching every thread in and out. Especially it becomes worst with managed code because every time GC has to run , it would have walk the threads stack to look for roots and when these threads are paged to the disk the cost to read from the disk is expensive,which adds up Perf issue.
Creating threads are Bad unless you know what you are doing? Jeffery Richter has written in detail about this.
To solve the above issue I would look what these threads are blocked on and also put a break-point on Thread Create (example sxe ct within windbg)
And later rearchitect from avoid creating threads , instead use the thread pool.
It would have been nice to some callstacks of these threads.

In Microsoft Enterprise Library 4.1, the BackgroundScheduler class creates a new thread each time an object is instantiated. It will be fixed in version 5.0. I do not know enough of this Microsoft Library to advise you how to avoid that behavior, but you may try the beta version: http://entlib.codeplex.com/wikipage?title=EntLib5%20Beta2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string