Excessive amount of system calls when using `threadDelay` - multithreading

I'm having a couple of Haskell processes running in production on a system with 12 cores. All processes are compiled with -threaded and run with 12 capabilities. One library they all use is resource-pool which keeps a pool of database connection.
What's interesting is that even though all processes are practically idle they consume around 2% CPU time. Inspecting one of these processes with strace -p $(pgrep processname) -f reveals that the process is doing an unreasonable amount of system calls even though it should not really be doing anything. To put things into perspective:
Running strace on a process with -N2 for 5 seconds produces a 66K log file.
Running it with (unreasonable) -N64 yields a 60 Megabyte log.
So the number of capabilities increases the amount of system calls being issued drastically.
Digging deeper we find that resource-pool is running a reaper thread which fires every second to inspect if it can clean up some resources. We can simulate the same behavior with this trivial program.
module Main where
import Control.Concurrent
import Control.Monad (forever)
main :: IO ()
main = forever $ do
threadDelay (10 ^ 6)
If I pass -B to the runtime system I get audio feedback whenever a GC is issued, which in this case is every 60 seconds.
So when I suppress these GC cycles by passing -I0 to the RTS running the strace command on the process only yields around 70K large log files. Since the process is also running a scotty server, GC is triggered when requests are coming in, so they seem to happen when I actually need them.
Since we are going to increase the amount of Haskell processes on this machine by a large amount during the course of the next year I was wondering how to keep their idle time at a reasonable level. Apparently passing -I0 seems to be a rather bad idea (?). Another idea would be to just decrease the number of capabilities from 12 to maybe something like 4. Is there any other way to tweak the RTS so that I can keep the processes from burning to many CPU cycles while idling?

The way GHC's memory management is structured, in order to keep memory usage under control, a 'major GC' is periodically needed, during the running of the program. This is a relatively expensive operation, and it 'stops the world' - the program makes no progress whilst this is occurring.
Obviously, it is undesirable for this to happen at any crucial point of program execution. Therefore by default, whenever a GHC-compiled program goes idle, a major GC is performed. This is usually an unobtrusive way of keeping the garbage level down and overall memory efficiency and performance up, without interrupting program interaction. This is known as 'idle GC'.
However, this can become a problem in scenarios like this: Many concurrent processes, Each of them woken frequently, running for a short amount of time, then going back to idle. This is a common scenario for server processes. In this case, when idle GC kicks in, it doesn't obstruct the process it is running in, which has completed its work, but it does steal resources from other processes running on the system. Since the program frequently idles, it is not necessary for the overhead of a major GC to be incurred on every single idle.
The 'brute force' approach would be to pass the RTS option -I0 to the program, disabling idle GC entirely. This will solve this in the short run, but misses an opportunity to collect garbage. This could allow garbage to accumulate, causing GC to kick in at an inopportune moment.
Partly in response to this question, the flag -Iw was added to the GHC runtime system. This establishes a minimum interval between which idle GCs are allowed to run. For example, -Iw5 will not run idle GC until 5 seconds have elapsed since the last GC, even if the program idles several times. This should solve the problem.
Just keep in mind the caveat in the GHC User's Guide:
This is an experimental feature, please let us know if it causes problems and/or could benefit from further tuning.
Happy Haskelling!

Related

What could delay pthread_join() after threads have exited successfully?

My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.

100% cpu usage profile output, what could cause this based on our profile log?

We have a massively scaled nodejs project (~1m+ users) that is suddenly taking a massive beating on our CPU. (Epyc 24c 2ghz)
We've been trying to debug what's using all our CPU using a profiler, (and I can show you the output down below) and it's behaving really weirdly whatever it is.
We have a master process that spawns 48 clusters, after they're all loaded the cpu usage slowly grows to max. After killing a cluster, the LA doesn't dip at all. However after killing the master process, it all goes back to normal.
The master process obviously isn't maxing all threads, and killing a cluster should REALLY do the trick?
We even stopped the user input of the application and a cluster entirely and it didn't reduce cpu usage at all.
We've got plenty of log files we could send if you want them.
Based on the profile, it looks like the code is spending a lot of time getting the current time from the system. Do you maybe have Date.now() (or oldschool, extra-inefficient +new Date()) calls around a bunch of frequently used, relatively quick operations? Try removing those, you should see a speedup (or drop in CPU utilization, respectively).
As for stopping user input not reducing CPU load: are you maybe scheduling callbacks? Or promises, or other async requests? It's not difficult to write a program that only needs to be kicked off and then keeps the CPU busy forever on its own.
Beyond these rough guesses, there isn't enough information here to dig deeper. Is there anything other than time-related stuff on the profile, further down? In particular, any of your own code? What does the bottom-up profile say?

Determining latency in threads that use sleep_for using SCHED_FIFO

I have a Linux embedded system built with PREEMPT_RT (real time patch) that creates multiple SCHED_FIFO threads, with a priority of 90 each. The goal is that they execute without being preempted, but with the same priority.
Each thread does a little bit of work, then goes to sleep using std::this_thread::sleep_for() for a few milliseconds, then gets scheduled back and executes the same amount of work.
Most of the time, each thread latency is impeccable, but once every minute or so (not an exact regular interval) all threads get hogged at the same time for one second or more (instead of the low milliseconds they usually get called at).
I have made sure Power management is disabled in the kernel kconfig, I have called mlockall() to avoid memory getting paged out, to no avail.
I have tried to use ftrace with wakeup_rt as the tracer, but the highest latency recorded was around 5ms, not nearly enough time to be the cause of the issue.
I am not sure what tool would be best to identify where the latency is coming from. Does anyone have ideas please?

Does a Tickless Linux Kernel Introduce Benchmark Timing Variations?

I'm running some benchmarks and I'm wondering whether using a "tickless" (a.k.a CONFIG_NO_HZ_FULL_ALL) Linux kernel would be useful or detrimental to benchmarking.
The benchmarks I am running will be repeated many times using a new process each time. I want to control as many sources of variation as possible.
I did some reading on the internet:
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
https://lwn.net/Articles/549580/
From these sources I have learned that:
In the default configuration (CONFIG_NO_HZ=y), only non-idle CPUs receive ticks. Therefore under this mode my benchmarks would always receive ticks.
In "tickless" mode (CONFIG_NO_HZ_FULL_ALL), all CPUs but one (the boot processor) are in "adaptive-tick" mode. When a CPU is in adaptive-tick mode, ticks are only received if there is more than a single job in the schedule queue for the CPU. The idea being that if there is a sole process in the queue, a context switch cannot happen, so sending ticks is not necessary.
On one hand, not having benchmarks receive ticks seems like a great idea, since we rule out the tick routine as a source of variation (we do not know how long the tick routines take). On the other hand, I think tickless mode could introduce variations in benchmark timings.
Consider my benchmarking scenario running on a tickless kernel. Suppose we repeat a benchmark twice.
Suppose the first run is lucky, and gets scheduled onto an adaptive-tick CPU which was previously idle. This benchmark will therefore not be interrupted by ticks.
When the benchmark is run a second time, perhaps it is not so lucky, and gets put on a CPU which already has some processes scheduled. This run will be interrupted by ticks at regular intervals in order to decide if one of the other processes should we switched in.
We know that ticks impose a performance hit (context switch plus the time taken to run the routine). Therefore the first benchmark run had an unfair advantage, and would appear to run faster.
Note also that a benchmark which initially has an adaptive-tick CPU to itself may find that mid-benchmark another process gets thrown on to the same CPU. In this case the benchmark is initially not receiving ticks, then later starts receiving them. This means benchmark performance can change over time.
So I think tickless mode (under my benchmarking scenario at-least) introduces timing variations. Is my reasoning correct?
One solution would be to use an isolated adaptive-tick CPU for benchmarking (isolcpus + taskset), however we have already ruled out isolated CPUs since this introduces artificial slowdowns in our multi-threaded benchmarks.
Thanks
For your "unlucky" scenario above, there has to be an active job scheduled on the same processor. This is not likely to be the case on an otherwise generally idle system, assuming that you have multiple processors. Even if this happens on one or two occasions, that means your benchmark might see the effect of one or two ticks - which hardly seems problematic.
On the other hand if it happens on many more occasions, this would be a general indication of high processor load - not an ideal scenario for running benchmarks anyway.
I would suggest, though, that "ticks" are not likely to be a significant source of variation in your benchmark timings. The scheduler is supposed to be O(1). I doubt you will see much difference in variation between tickless and non-tickless mode.

Question about app with multiple threads in a few CPU-machine

Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.

Resources