How to get Java stacks when JVM can't reach a safepoint

How to get Java stacks when JVM can't reach a safepoint - garbage-collection

We recently had a situation where one of our production JVMs would randomly freeze. The Java process was burning CPU, but all visible activity would cease: no log output, nothing written to the GC log, no response to any network request, etc. The process would persist in this state until restarted.
It turned out that the org.mozilla.javascript.DToA class, when invoked on certain inputs, will get confused and call BigInteger.pow with enormous values (e.g. 5^2147483647), which triggers the JVM freeze. My guess is that some large loop, perhaps in java.math.BigInteger.multiplyToLen, was JIT'ed without a safepoint check inside the loop. The next time the JVM needed to pause for garbage collection, it would freeze, because the thread running the BigInteger code wouldn't be reaching a safepoint for a very long time.
My question: in the future, how can I diagnose a safepoint problem like this? kill -3 didn't produce any output; I presume it relies on safepoints to generate accurate stacks. Is there any production-safe tool which can extract stacks from a running JVM without waiting for a safepoint? (In this case, I got lucky and managed to grab a set of stack traces just after BigInteger.pow was invoked, but before it worked its way up to a sufficiently large input to completely wedge the JVM. Without that stroke of luck, I'm not sure how we would ever have diagnosed the problem.)
Edit: the following code illustrates the problem.
// Spawn a background thread to compute an enormous number.
new Thread(){ #Override public void run() {
try {
Thread.sleep(5000);
} catch (InterruptedException ex) {
}
BigInteger.valueOf(5).pow(100000000);
}}.start();
// Loop, allocating memory and periodically logging progress, so illustrate GC pause times.
byte[] b;
for (int outer = 0; ; outer++) {
long startMs = System.currentTimeMillis();
for (int inner = 0; inner < 100000; inner++) {
b = new byte[1000];
}
System.out.println("Iteration " + outer + " took " + (System.currentTimeMillis() - startMs) + " ms");
}
This launches a background thread which waits 5 seconds and then starts an enormous BigInteger computation. In the foreground, it then repeatedly allocates a series of 100,000 1K blocks, logging the elapsed time for each 100MB series. During the 5 second period, each 100MB series runs in about 20 milliseconds on my MacBook Pro. Once the BigInteger computation begins, we begin to see long pauses interleaved. In one test, the pauses were successively 175ms, 997ms, 2927ms, 4222ms, and 22617ms (at which point I aborted the test). This is consistent with BigInteger.pow() invoking a series of ever-larger multiply operations, each taking successively longer to reach a safepoint.

Your problem interested me very much. You were right about JIT. First I tried to play with GC types, but this did not have any effect. Then I tried to disable JIT and everything worked fine:
java -Djava.compiler=NONE Tests
Then printed out JIT compilations:
java -XX:+PrintCompilation Tests
And noticed that problem starts after some compilations in BigInteger class, I tried to exclude methods one by one from compilation and finally found the cause:
java -XX:CompileCommand=exclude,java/math/BigInteger,multiplyToLen -XX:+PrintCompilation Tests
For large arrays this method could work long, and problem might really be in safepoints. For some reason they are not inserted, but should be even in compiled code. Looks like a bug. The next step should be to analyze assembly code, I did not do it yet.

It's not a bug, it's a performance feature. JVM eliminates safepoint check from counted loops, making them run faster. It expects that either
you care about STW pauses and don't have extra long loops
or you have extra long loops, but fine with safepoints being eventual
If it doesn't fit you, it can be switched off with this flag: -XX:+UseCountedLoopSafepoints
And answering the title question, you can still stop and explore a program with gdb, but the stack traces wouldn't be so nice.

Perhaps that's what the "-F" option of jstack is good for:
OPTIONS
-F
Force a stack dump when 'jstack [-l] pid' does not respond.
I always wondered when an why that could help.

Related

Fps drops when using .NETs ThreadPool

I have asked about this before, but didn't provide code because I didn't have an easy way to do so. However now I've started a new project in Unity and tried to replicate the behaviour without all the unnecessary baggage attached.
So this is my current setup:
public class Main : MonoBehaviour
{
public GameObject calculatorPrefab;
void Start ()
{
for (int i = 0; i < 10000; i++)
{
Instantiate(calculatorPrefab);
}
}
}
public class Calculator : MonoBehaviour
{
void Start ()
{
ThreadPool.QueueUserWorkItem(DoCalculations);
}
void DoCalculations(object o)
{
// Just doing some pointless calculations so the thread actually has something to do.
float result = 0;
for (int i = 0; i < 1000; i++)
{
// Note that the loop count doesn't seem to matter at all, other than taking longer.
for (int i2 = 0; i2 < 1000; i2++)
{
result = i * i2 * Mathf.Sqrt(i * i2 + 59);
}
}
}
}
Both scripts are attached to GameObjects. The 'Main' script is on a GameObject thats placed in the scene and is supposed to create a bunch of other GameObjects at start up which then in turn queue some random calculations for the ThreadPool. Obviously this produces a fairly big CPU spike at start up, but that's not the problem. The problem is that the main thread seems to be blocked by this. In other words, it produces horrible fps. Why is that ? Isn't it supposed to run in the background ? Isn't the whole point behind this not to make the main thread unresponsive ?
I'm really struggling to figure out what I'm doing wrong, because as far as I see it, it doesn't get much simpler than this.

On the first frame you instantiate 10000 prefabs. That is quite a load for a single frame. On the second frame you initialize 10000 thread pools. That is quite a number of threads and I am sure you are running into some upfront initialization costs.
The background task is not that complex. I use background tasks for really long running operations. For instance web calls and long running calculations. I dont think your task really fits. In other words the upfront cost exceeds the cost of running your calculation.
Try using a coroutine instead to breakup your calculations and instantiations. I think that is a better solution for this particular background task.
Edit Ran some tests per the comments below.
10k instantiates took on average (median) 104 milliseconds. The editor had poor framerate and used about 15% of my I7 cpu capacity.
10k QueueUserWorkItem took on average (median) 23 milliseconds. The editor locked up for multiple seconds. My cpu capacity had a wonderfull 99% capacity.
Conclusion
Queuing the worker thread has some cost, but not a lot. The problems are mainly with your instantiate. That, and why are we quing 1000 worker threads for such a simple calculation ?

I see the following problems with your code.
You are creating far too many background jobs at once, 10,000 to be precise. .NET won't run them all concurrently all the same but still is perhaps not the best way to go. On my machine (8 logical cores) the initial max workers via ThreadPool.GetMaxThreads() was 1023
Each job is rather complex. The calculation of Sqrt is not cheap and no wonder takes so long
Unity has methods for updating and methods for drawing. The problem here is that your jobs are ongoing and thus drags down everything including updating; drawing; and everything in between; rather than computation just happening during Update()
Taking your code and just running it in a stand-alone .NET app, it took 15 seconds to complete maxing out all 8 of my cores.
However, changing
result = i * i2 * Mathf.Sqrt(i * i2 + 59)
...to:
result = i * i2 * i * i2 + 59;
...also maxed out all of my 8 cores as before but this time took 6 seconds.
You might ask, "well you took away to sqrt, what is your point". My point is I don't believe you realise how intensive a call Sqrt is particularly with this statement:
And it's still terrible. I even reduced the amount of objects being created from 10000 to 100, while increasing the loop count so it still takes a while. No real difference
Furthermore, scheduling so many jobs, regardless of tech, purely to update game objects won't scale. Game designers update in batches.
My suggestion:
Design tip
Generally when there is alot of calculations that must be performed for many objects, instead of doing so in one frame, group them and spread them out over time. So for 10,000 objects maybe have a batch size of 1000 or 100? Source: Cities: Skylines;
Tell me more
Game Engine Architecture
Image copyright respective owners

Process Vs Thread : Looking for best explanation with example c#

Apologized posting the above question here because i read few same kind of thread here but still things is not clear.
As we know that Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces. (quoted from this answer)
the above explanation is not enough to visualize the actual things. it will be better if anyone explain what is process with example and how it is different than thread with example.
suppose i start a MS-pain or any accounting program. can we say that accounting program is process ? i guess no. a accounting apps may have multiple process and each process can start multiple thread.
i want to visualize like which area can be called as process when we run any application. so please explain and guide me with example for better visualization and also explain how process and thread is not same. thanks

suppose i start a MS-pain or any accounting program. can we say that accounting program is process ?
Yes. Or rather the current running instance of it is.
i guess no. a accounting apps may have multiple process and each process can start multiple thread.
It is possible for a process to start another process, but relatively usual with windowed software.
The process is a given executable; a windowed application, a console application and a background application would all each involve a running process.
Process refers to the space in which the application runs. With a simple process like NotePad if you open it twice so that you have two NotePad windows open, you have two NotePad processes. (This is also true of more complicated processes, but note that some do their own work to keep things down to one, so e.g. if you have Firefox open and you run Firefox again there will briefly be two Firefox processes but the second one will tell the first to open a new window before exiting and the number of processes returns to one; having a single process makes communication within that application simpler for reasons we'll get to now).
Now each process will have to have at least one thread. This thread contains information about just what it is trying to do (typically in a stack, though that is certainly not the only possible approach). Consider this simple C# program:
static int DoAdd(int a, int b)
{
return a + b;
}
void Main()
{
int x = 2;
int y = 3;
int z = DoAdd(x, y);
Console.WriteLine(z);
}
With this simple program first 2 and 3 are stored in places in the stack (corresponding with the labels x and y). Then they are pushed onto the stack again and the thread moves to DoAdd. In DoAdd these are popped and added, and the result pushed to the stack. Then this is stored in the stack (corresponding with the labels z). Then that is pushed again and the thread moves to Console.WriteLine. That does its thing and the thread moves back to Main. Then it leaves and the thread dies. As the only foreground thread running its death leads to the process also ending.
(I'm simplifying here, and I don't think there's a need to nitpick all of those simplifications right now; I'm just presenting a reasonable mental model).
There can be more than one thread. For example:
static int DoAdd(int a, int b)
{
return a + b;
}
static void PrintTwoMore(object num)
{
Thread.Sleep(new Random().Next(0, 500));
Console.WriteLine(DoAdd(2, (int)num));
}
void Main()
{
for(int i = 0; i != 10; ++i)
new Thread(PrintTwoMore).Start(i);
}
Here the first thread creates ten more threads. Each of these pause for a different length of time (just to demonstrate that they are independent) and then do a similar task to the first example's only thread.
The first thread dies upon creating the 10th new thread and setting it going. The last of these 10 threads to be running will be the last foreground thread and so when it dies so does the process.
Each of these threads can "see" the same methods and can "see" any data that is stored in the application though there are limits on how likely they are to stamp over each other I won't get into now.
A process can also start a new process, and communicate with it. This is very common in command-line programs, but less so in windowed programs. In the case of Windowed programs its also more common on *nix than on Windows.
One example of this would be when Geany does a find-in-directory operation. Geany doesn't have its own find-in-directory functionality but rather runs the program grep and then interprets the results. So we start with one process (Geany) with its own threads running then one of those threads causes the grep program to run, which means we've also got a grep process running with its threads. Geany's threads and grep's threads cannot communicate to each other as easily as threads in the same process can, but when grep outputs results the thread in Geany can read that output and use that to display those results.

Why would multiple executions of IronPython code cause OutOfMemoryException?

I have some IronPython code in a C# class that has the following two lines in a function:
void test(MyType request){
compiledCode.DefaultScope.SetVarialbe("target", request);
compiledCode.Execute();
}
I was writing some code to stress-test this/performance test it. The request object that I'm passing in is a List<MyType> with one million MyType's. So I iterate over this list:
foreach(MyType thing in myThings){
test(thing);
}
And that works fine - runs in about 6 seconds. When I run
for(int x = 0; x < 1000; x++){
foreach(MyType thing in myThings){
test(thing);
}
}
On the 19th iteration of the loop I get an OutOfMemoryException.
I tried doing Python.CreateEngine(AppDomain.CurrentDomain) instead, which didn't seem to have any effect, and I tried
...
compiledCode.Execute();
compiledCode.DefaultScope.RemoveVariable("target");
Hoping that somehow IronPython was keeping the reference around. That roughly doubled the amount of time it took, but it still blew up at the same point.
Why is my code causing a memory leak, and how do I fix it?

IronPython has a memory leak somewhere that prevents objects from getting GC'd. (let this be a lesson, kids: managed languages are not immune from memory issues. They just have different ones.) I haven't been able to figure out where, though.
You could create an issue and attach a self-contained reproduction, but even better would be running your app against a memory profiler and see what is holding onto your MyType objects.

So, it turns out I wasn't thinking at all.
The target object object contains a StringBuilder that I'm adding lines to. About 10 lines per iteration, so apparently I can hold about 190 million lines of text in a StringBuilder before I run out of memory ;)

context switch measure time

I wonder if anyone of you know how to to use the function get_timer()
to measure the time for context switch
how to find the average?
when to display it?
Could someone help me out with this.
Is it any expert who knows this?

One fairly straightforward way would be to have two threads communicating through a pipe. One thread would do (pseudo-code):
for(n = 1000; n--;) {
now = clock_gettime(CLOCK_MONOTONIC_RAW);
write(pipe, now);
sleep(1msec); // to make sure that the other thread blocks again on pipe read
}
Another thread would do:
context_switch_times[1000];
while(n = 1000; n--;) {
time = read(pipe);
now = clock_gettime(CLOCK_MONOTONIC_RAW);
context_switch_times[n] = now - time;
}
That is, it would measure the time duration between when the data was written into the pipe by one thread and the time when the other thread woke up and read that data. A histogram of context_switch_times array would show the distribution of context switch times.
The times would include the overhead of pipe read and write and getting the time, however, it gives a good sense of big the context switch times are.
In the past I did a similar test using stock Fedora 13 kernel and real-time FIFO threads. The minimum context switch times I got were around 4-5 usec.

I dont think we can actually measure this time from User space, as in kernel you never know when your process is picked up after its time slice expires. So whatever you get in userspace includes scheduling delays as well. However, from user space you can get closer measurement but not exact always. Even a jiffy delay matters.

I believe LTTng can be used to capture detailed traces of context switch timings, among other things.

What can make a program run slower when using more threads?

This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
bin_index is a completely deterministic function of its arguments which, for purposes of this question, does not use or change any shared state - in other words, it is manifestly reentrant.
I first wrote this program to use a single thread. Then I converted it to use multiple threads, such that thread n runs all iterations of the outer loop where i1 % nthreads == n. So the function that runs in each thread looks like
for (int i1 = n; i1 < N; i1 += nthreads)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
thread_local_histogram[bin_index(i1, i2, i3, i4)] += 1;
and all the thread_local_histograms are added up in the main thread at the end.
Here's the strange thing: when I run the program with just 1 thread for some particular size of the calculation, it takes about 6 seconds. When I run it with 2 or 3 threads, doing exactly the same calculation, it takes about 9 seconds. Why is that? I would expect that using 2 threads would be faster than 1 thread since I have a dual-core CPU. The program does not use any mutexes or other synchronization primitives so two threads should be able to run in parallel.
For reference: typical output from time (this is on Linux) for one thread:
real 0m5.968s
user 0m5.856s
sys 0m0.064s
and two threads:
real 0m9.128s
user 0m10.129s
sys 0m6.576s
The code is at http://static.ellipsix.net/ext-tmp/distintegral.ccs
P.S. I know there are libraries designed for exactly this kind of thing that probably could have better performance, but that's what my last question was about so I don't need to hear those suggestions again. (Plus I wanted to use pthreads as a learning experience.)

To avoid further comments on this: When I wrote my reply, the questioner hasn't posted a link to his source yet, so I could not tailor my reply to his specific issues. I was only answering the general question what "can" cause such an issue, I never said that this will necessarily apply to his case. When he posted a link to his source, I wrote another reply, that is exactly only focusing on his very issue (which is caused by the use of the random() function as I explained in my other reply). However, since the question of this post is still "What can make a program run slower when using more threads?" and not "What makes my very specific application run slower?", I've seen no need to change my rather general reply either (general question -> general response, specific question -> specific response).
1) Cache Poisoning
All threads access the same array, which is a block of memory. Each core has its own cache to speed up memory access. Since they don't just read from the array but also change the content, the content is changed actually in the cache only, not in real memory (at least not immediately). The problem is that the other thread on the other core may have overlapping parts of memory cached. If now core 1 changes the value in the cache, it must tell core 2 that this value has just changed. It does so by invalidating the cache content on core 2 and core 2 needs to re-read the data from memory, which slows processing down. Cache poisoning can only happen on multi-core or multi-CPU machines. If you just have one CPU with one core this is no problem. So to find out if that is your issue or not, just disable one core (most OSes will allow you to do that) and repeat the test. If it is now almost equally fast, that was your problem.
2) Preventing Memory Bursts
Memory is read fastest if read sequentially in bursts, just like when files are read from HD. Addressing a certain point in memory is actually awfully slow (just like the "seek time" on a HD), even if your PC has the best memory on the market. However, once this point has been addressed, sequential reads are fast. The first addressing goes by sending a row index and a column index and always having waiting times in between before the first data can be accessed. Once this data is there, the CPU starts bursting. While the data is still on the way it sends already the request for the next burst. As long as it is keeping up the burst (by always sending "Next line please" requests), the RAM will continue to pump out data as fast as it can (and this is actually quite fast!). Bursting only works if data is read sequentially and only if the memory addresses grow upwards (AFAIK you cannot burst from high to low addresses). If now two threads run at the same time and both keep reading/writing memory, however both from completely different memory addresses, each time thread 2 needs to read/write data, it must interrupt a possible burst of thread 1 and the other way round. This issue gets worse if you have even more threads and this issue is also an issue on a system that has only one single-core CPU.
BTW running more threads than you have cores will never make your process any faster (as you mentioned 3 threads), it will rather slow it down (thread context switches have side effects that reduce processing throughput) - that is unlike you run more threads because some threads are sleeping or blocking on certain events and thus cannot actively process any data. In that case it may make sense to run more threads than you have cores.

Everything I said so far in my other reply holds still true on general, as your question was what "can"... however now that I've seen your actual code, my first bet would be that your usage of the random() function slows everything down. Why?
See, random keeps a global variable in memory that stores the last random value calculated there. Each time you call random() (and you are calling it twice within a single function) it reads the value of this global variable, performs a calculation (that is not so fast; random() alone is a slow function) and writes the result back there before returning it. This global variable is not per thread, it is shared among all threads. So what I wrote regarding cache poisoning applies here all the time (even if you avoided it for the array by having separated arrays per thread; this was very clever of you!). This value is constantly invalidated in the cache of either core and must be re-fetched from memory. However if you only have a single thread, nothing like that happens, this variable never leaves cache after it has been initially read, since it's permanently accessed again and again and again.
Further to make things even worse, glibc has a thread-safe version of random() - I just verified that by looking at the source. While this seems to be a good idea in practice, it means that each random() call will cause a mutex to be locked, memory to be accessed, and a mutex to be unlocked. Thus two threads calling random exactly the same moment will cause one thread to be blocked for a couple of CPU cycles. This is implementation specific, though, as AFAIK it is not required that random() is thread safe. Most standard lib functions are not required to be thread-safe, since the C standard is not even aware of the concept of threads in the first place. When they are not calling it the same moment, the mutex will have no influence on speed (as even a single threaded app must lock/unlock the mutex), but then cache poisoning will apply again.
You could pre-build an array with random numbers for every thread, containing as many random number as each thread needs. Create it in the main thread before spawning the threads and add a reference to it to the structure pointer you hand over to every thread. Then get the random numbers from there.
Or just implement your own random number generator if you don't need the "best" random numbers on the planet, that works with per-thread memory for holding its state - that one might be even faster than the system's built-in generator.
If a Linux only solution works for you, you can use random_r. It allows you to pass the state with every call. Just use a unique state object per thread. However this function is a glibc extension, it is most likely not supported by other platforms (neither part of the C standards nor of the POSIX standards AFAIK - this function does not exist on Mac OS X for example, it may neither exist in Solaris or FreeBSD).
Creating an own random number generator is actually not that hard. If you need real random numbers, you shouldn't use random() in the first place. Random only creates pseudo-random numbers (numbers that look random, but are predictable if you know the generator's internal state). Here's the code for one that produces good uint32 random numbers:
static uint32_t getRandom(uint32_t * m_z, uint32_t * m_w)
{
*m_z = 36969 * (*m_z & 65535) + (*m_z >> 16);
*m_w = 18000 * (*m_w & 65535) + (*m_w >> 16);
return (*m_z << 16) + *m_w;
}
It's important to "seed" m_z and m_w in a proper way somehow, otherwise the results are not random at all. The seed value itself should already be random, but here you could use the system random number generator.
uint32_t m_z = random();
uint32_t m_w = random();
uint32_t nextRandom;
for (...) {
nextRandom = getRandom(&m_z, &m_w);
// ...
}
This way every thread only needs to call random() twice and then uses your own generator. BTW, if you need double randoms (that are between 0 and 1), the function above can be easily wrapped for that:
static double getRandomDouble(uint32_t * m_z, uint32_t * m_w)
{
// The magic number below is 1/(2^32 + 2).
// The result is strictly between 0 and 1.
return (getRandom(m_z, m_w) + 1) * 2.328306435454494e-10;
}
Try to make this change in your code and let me know how the benchmark results are :-)

You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.

One possibility is that the time taken to create the threads exceeds the savings gained by using threads. I would think that N is not very large, if the elapsed time is only 6 seconds for a O(n^4) operation.
There's also no guarantee that multiple threads will run on different cores or CPUs. I'm not sure what the default thread affinity is with Linux - it may be that both threads run on a single core which would negate the benefits of a CPU-intensive piece of code such as this.
This article details default thread affinity and how to change your code to ensure threads run on specific cores.

Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.

Off the top of my head:
Context switches
Resource contention
CPU contention (if they aren't getting split to multiple CPUs).
Cache thrashing

David,
Are you sure you run a kernel that supports multiple processors? If only one processor is utilized in your system, spawning additional CPU-intensive threads will slow down your program.
And, are you sure support for threads in your system actually utilizes multiple processors? Does top, for example, show that both cores in your processor utilized when you run your program?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string