I have a Java app running on Linux Mint. EVERY minute, the program shows a very noticeable slow down -- A pause. The pause is a consistent 3 to 4 seconds. When we run further instances of the same program, they also pause 3 to 4 seconds each minute. Each program stops on a different second of the minute.
latest update:
After the last update (below) increasing the thread pool's thread count saw the GUI problem go away. After running for around ~40 hours we observed a thread leak in the Jetty HttpClient blocking-GET (Request.send()) call. To explain the mechanics, using the Executor class: a main thread runs every few minutes. It uses Executor to run an independent thread to call the host with a HTTP GET command, Jetty's HttpClient.request.send().
After about 40 hours of operation, there was a spike on the number of threads running in the HttpClient pool. So for 40 hours, the same threads ran fine. The working hypothesis is that around that time, one or more send() calls did not complete or time-out and have not returned to the calling thread. Essentially this/these threads are hung inside the Jetty Client.
When watching each regular cycle in jVisualMV we see the normal behaviour each cycle; some HttpClient threads fire up for the host GET, execute and go-away in just a few seconds. Also on the monitor are about 10 thread belonging to the Jetty HttpClient thread pool that have been 'present' for (now) 10 hours.
The expectation is that there was some error in underlying client or network processing. I am surprised there was no time-out exception or programming exception. There are some clear question I can ask now.
What can happen inside HttpClient that could just hang a Request.send()
What is the time-out on the call return? I would think there will still be absolute time-outs or checks for locking, etc (no?)
Can the I/O system hang and leave the caller-thread hanging -- While Java obediently ...
Fires the manager thread at the scheduled time, then
The next Http.Request.send() happens,
A new thread(s) from the pool run-up for the next send (as appears to have happened).
While the earlier send() is stuck in limbo
Can I limit or other wise put a clean-up on these stuck threads?
This was happening before we increased the thread pool size. What's happened is that the 'blame' has become more focused on the problem area. also we are suspicious of the underlying system because we also had lock-ups with Apache HttpClient again around the same (non-specific) time of day.
(prior update) ...
The pause behaviour observed is the JavaFX GUI does not update/refresh; the display's clock (textView), setText() call was logged during the freeze with two x updates per second (that's new information). The clock doesn't update (on Mint Linux), it continues to update when running on Windows. To forestall me repeating myself for questions about GC, logs, probes, etc. the answer will be the same; we have run extensive diagnostics over weeks now. The issue is unmistakably a mix of: Linux JVM / Linux Mint / Threads (per JavaFX). Other piece of new data is that increasing the thread-pool count by +2, appears to remove the freeze -- Further testing is needed to confirm that and tune the numbers. The question though is "What are the parameters that make the difference between the two platforms?"
We have run several instances of the program on Windows for days with no pauses. When we run on a Mint Linux platform we see the freeze, it is very consistent.
The program has several running threads running on a schedule. One thread opens the internet for an http socket. When we comment out that area, the pause vanishes. However we don't see that behaviour using Windows. Experiments point to something specific to the Mint networking I/O subsystem, linux scheduling, the Linux Java 8 JVM, or some interaction between the two.
As you may guess, we are tearing our hair out on this one. For example, we turned off logging and the pause remained. We resumed logging and just did one call to the http server, pause every 60 seconds, on the same second count. This happens even when we do no other processing. We tried different http libraries, etc. Seems very clear it is in the JVM or Linux.
Does anyone know of a way to resolve this?
Related
My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.
I have go based reverse proxy server. When I monitor metrics on the app (using prometheus), Ive noticed that when the load on the app goes up, the threads (go_threads) on the app goes from around 20 to about 55. Then after the load goes away, these threads still around, even after many hours.
However I can see the go_goroutines and the memory usage go down, but not the threads.
I have a couple of questions
What is the default size of the thread pool in go?
How long do idle threads stick around?
What is the default size of the thread pool in go?
GOMAXPROCS
How long do idle threads stick around?
Until the process terminates
Notice:
There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
This means your best chance to keep thread count low is to have few blocking system calls.
For your first question, I couldn't find a default size of the thread pool, however the documentation for runtime/debug.SetMaxThreads seems to indicate that a new thread is created whenever an existing OS thread is blocked. It could be the case that the Go runtime starts with only a single OS thread and creates more as needed, however that could vary depending on the version of Go.
As for your second question, idle OS threads stick currently stick around forever. There's an open issue https://github.com/golang/go/issues/14592 that deals with closing idle threads, however as of the latest Go release (1.14) there has been little to no work towards doing so.
I use work manager to do database synchronization in several university to core banking:
the sync will start every 5 minutes until completed.
but I've got an error:
ThreadMonitor W WSVR0605W: Thread "WorkManager.DefaultWorkManager : 1250" (00001891) has been active for 1009570 milliseconds and may be hung. There is/are 2 thread(s) in total in the server that may be hung.
This error causes the database sync to rollback automatically.
I found some documentation here: http://publib.boulder.ibm.com/infocenter/wasinfo/v6r0/index.jsp?topic=/com.ibm.websphere.express.doc/info/exp/ae/ttrb_confighangdet.html
ThreadMonitor always monitors the active thread, and after the thread is active for more than N milliseconds than set in alarm threshold, ThreadMonitor always gives the above error message. However, I notice all my sync operations take longer than N to complete.
My question is, does ThreadMonitor just report a warning when the active thread runs more than N milliseconds (i.e., it's a hung thread) or does ThreadMonitor also kill hung threads?
ThreadMonitor simply monitors the threads which are active beyond a threshold time.
This should serve as warnings to the WAS administrators that some thread is using a lot of time to process (which might be genuine or otherwise)
The ThreadMonitor will not kill the thread.
In many cases, it might genuinely take a long time to process (depending on what it does) so the ThreadMonitor simply restricts itself to identifying potentially hung threads and leaves the actual job of finding out what the thread is doing (based on thread dumps and locating the specific ThreadID)
The threshold time can be configured for your servers if you desire to have a different value from the default.
#Muky,
com.ibm.websphere.threadmonitor.threshold is the property that you need to configure.
Look at this URL: http://pic.dhe.ibm.com/infocenter/wasinfo/v7r0/index.jsp?topic=%2Fcom.ibm.websphere.express.doc%2Finfo%2Fexp%2Fae%2Fttrb_confighangdet.html for more details.
HTH
Manglu
Reading CLR via C# 2.0 (I dont have 3.0 with me at the moment)
Is this still the case:
If there is only one CPU in a computer, only one thread can run at any one time. Windows has to keep track of the thread objects, and every so often, Windows has to decide which thread to schedule next to go to the CPU. This is additional code that has to execute once every 20 milliseconds or so. When Windows makes a CPU stop executing one thread's code and start executing another thread's code, we call this a context switch. A context switch is fairly expensive because the operating system has to:
So circa CLR via C# 2.0 lets say we are on Pentium 4 2.4ghz 1 core non-HT, XP. Every 20 milliseconds? Where a CLR thread or Java thread is mapped to an OS thread only a maximum of 50 threads per second may get a chance to to run?
I've read that context switching is very fast in mircoseconds here on SO, but how often roughly (magnitude style guesses) will say a modest 5 year old server Windows 2003 Pentium Xeon single core give the OS the opportunity to context switch? 20ms in the right area?
I dont need exact figures I just want to be sure that's in the right area, seems rather long to me.
The Quantum as it's called is dependant on a few things, including performance tweaks the operating system makes as it goes along; for instance the foreground process is given a higher priority and can be given [a quantum 3 times longer than default. There is a also a difference between Server and Client SKU typically a client would have a default quantum of 30ms where a server would be 180ms.
So a foreground process that wants as much CPU as it can get may get a quantum of 90ms before a context switch.. and then the OS may decide it doesn't need to switch and let the Quantum continue.
Your "50 threads at a time" math is wrong. You assume that each of those threads is in a 100% CPU state. Most threads are in fact asleep waiting for IO or other events. Even then, most threads don't use their entire 20 ms before going into IO mode or otherwise giving up its slice.
Try this. Write an app with an inifinite loop (eats its entire CPU window). Run 50 instances of it. See how Windows reacts.
I just did a test I got 43 threads seeing its share in a second (after warming up) which makes Richter statement pretty accurate (with overhead) I say. Quadcore/Win7/64bit. Yes these were 100% cpu threads so obviously they weren't given themselves back before their 20ms. Interesting
Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.