What could delay pthread_join() after threads have exited successfully?

What could delay pthread_join() after threads have exited successfully? - multithreading

My main thread creates 8 worker threads (on a machine with a 4 core, 8 thread CPU), and then waits for them to complete with pthread_join(). The threads all exit successfully, and the pthread_join() successfully completes. However, I log the times that the threads exit and the time that pthread_join() completes for the last thread; the threads all exit essentially simultaneously (not surprising -- they are servicing a queue of work to be done), and the pthread_join() sometimes takes quite a long time to complete -- I have seen times in excess of 15 minutes after the last worker thread has exited!
More information: The worker threads are all set at the highest allowable round-robin scheduling priority (SCHED_RR); I have tried setting the main thread (waiting on the pthread_join()s) to the same thing and have also tried setting it to the highest SCHED_FIFO priority (where so far I have only seen it take as long as 27 seconds to complete; more testing is needed). My test is very CPU and memory intensive and takes about 90 -- 100 minutes to complete; during that time it is generally using all 8 threads at close to 100% capacity, and fairly quickly gets to where it is using about 90% of the 256 GB of RAM. This is running on a Linux (Fedora) OS at run level 3 (so no graphics or Window Manager -- essentially just a terminal -- because at the usual run level 5, a process using that much memory gets killed by the system).
An earlier version that took closer to 4 hours to complete (I have since made some performance improvements...) and in which I did not bother explicitly setting the priority of the main thread once took over an hour and 20 minutes for the pthread_join() to complete. I mention it because I don't really think that the main thread priority should be much of an issue -- there is essentially nothing else happening on the machine, it is not even on the network.
As I mentioned, all the threads complete with EXIT_SUCCESS. And in lighter weight tests, where the processing is over in seconds, I see no such delay. And so I am left suspecting that this is a scheduler issue. I know very little about the scheduler, but informally the impression I have is that here is this thread that has been waiting on a pthread_join() for well over an hour; perhaps the scheduler eventually shuffles it off to a queue of "very unlikely to require any processing time" tasks, and only checks it rarely.
Okay, eventually it completes. But ultimately, to get my work done, I have to run about 1000 of these, and some are likely to take a great deal longer than the 90 minutes or so that the case I have been testing takes. So I have to worry that the pthread_join() in those cases might delay even longer, and with 1000 iterations, those delays are going to add up to real time...
Thanks in advance for any suggestions.
In response to Nate's excellent questions and suggestions:
I have used top to spy on the process when it is in this state; all I can report is that it is using minimal CPU (maybe an occasional 2%, compared to the usual 700 - 800% that top reports for 8 threads running flat out, modulo some contention for locked resources). I am aware that top has all kinds of options I haven't investigated, and will look into how to run it to display information about the state of the main thread. (I see: I can use the -H option, and look in the S column... will do.) It is definitely not a matter of all the memory being swapped out -- my code is very careful to stay below the limit of physical memory, and does some disk I/O of its own to save and restore information that can't fit in memory. As a result little to no virtual memory is in use at any time.
I don't like my theory about the scheduler either... It's just the best I have been able to come up with so far...
As far as how I am determining when things happen: The exiting code does:
time_t now;
time(&now);
printf("Thread exiting, %s", ctime(&now));
pthread_exit(EXIT_SUCCESS);
and then the main thread does:
for (int i = 0; i < WORKER_THREADS; i++)
{
pthread_join(threads[i], NULL);
}
time(&now);
printf("Last worker thread has exited, %s", ctime(&now));
I like the idea of printing something each time pthread_join() returns, to see if we're waiting for the first thread to complete, the last thread to complete, or one in the middle, and will make that change.
A couple of other potentially relevant facts that have occurred to me since my original posting: I am using the GMP (GNU Multiprecision Arithmetic) library, which I can't really imagine matters; and I am also using a 3rd party (open source) library to create "canonical graphs," and that library, in order to be used in a multithreaded environment, does use some thread_local storage. I will have to dig into the particulars; still, it doesn't seem like cleaning that up should take any appreciable amount of time, especially without also using an appreciable amount of CPU.

Related

Why is multi threading not faster on single core?

This question is not a duplicate of any question related to why multithreading is not faster on single-core, read the rest to figure out what I actually want to know
As far as I know, multithreading is only faster on a CPU with multiple cores, since each thread can run in parallel. However, as my understanding of how preemption and multithreading on single-core works, it should also be faster. The image below can describe what I mean better. Consider that our app is a simple loop that takes exactly 4 seconds to execute. In this example, the time slice is constant, but, I don't think it makes any difference because, in the end, all threads with the same priority will get equal time by the scheduler. The first timeline is single-threaded, but the second one has 4 threads. The cycle also means when the preemption ends and the scheduler goes back to the queue of threads from start. I/O has also been removed since that just adds complexity and even if it changes the results, let's assume I'm talking about some code that does not require any sort of I/O.
The red threads are threads related to my process, and others (black) are the ones for other processes and apps
There are a couple of questions here:
Why isn't it faster? What's wrong with my timeline?
What's that cycle point called?
Since the time slice is not fixed, does that means the Cycle time is fixed, or the time slice gets calculated and the cycle will be as much time required to spend the calculated time slice on each thread?
Is the slice time based on time or instruction? I mean, is it like 0.1 sec for each thread or like 10 instructions for each thread?
The CPU utilization is based on CPU time, so why isn't it always on 100% because when a thread's time reaches, it moves to the next thread, and if a thread stops on I/O, it does not wait but executes the next one, so the CPU always tries to find a thread to execute and minimalize the time spent IDLE. Is the time for I/O so significant that more than 50% of CPU time is spent doing nothing because all threads are waiting for something, mostly I/O and the CPU time is elapsed waiting for a thread to become in a ready state?
Note: This timeline is simplified, the time spent on I/O, thread creation, etc. is not calculated and it's assumed that other threads do not finish before the end of the timeline and have the same priority/nice value as our process

How is fairness of thread scheduling ensured across processes?

Every process has at least one thread of execution and I read somewhere that modern Operating Systems only schedule Thread and not process.
So if there are two processes running in the system - P1 with 1 thread and P2 with 100 threads, how will OS scheduling algorithm ensure that both P1 and P2 get approximately same amount of CPU time? If OS blindly schedules threads, P2 will get 100 times more CPU time than P1.
Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.

Does it also take into account which Process a particular thread belong to? Otherwise, it seems too easy for a process to hog all the CPU by creating more threads.
Wrong question. Consider two jobs that are trying to solve the exact same problem by doing the same work and are perfectly identical except for one thing -- one uses dozens of threads, the other uses dozens of processes. Why should the one that uses dozens of processes get more CPU time than the one that uses dozens of threads?
Your notion of fairness is not really a sensible one.
Instead, scheduling is more designed around trying to get as much work done as possible per unit time. The assumption is that everything the computer is doing is useful and it benefits competing tasks to have other tasks competing with them finish as quickly as possible too.
This is actually all you need the vast majority of the time. But occasionally you have special situations where this doesn't work. One is ultra-high-priority tasks like keeping video or audio flowing or keeping a user interface responsive. Another is ultra-low-priority tasks where there's an enormous amount of work you want done and you don't want the system to be slow for a long time while you're working on it. Priorities are used for this, and generally the system allows higher-priority threads to interrupt lower-priority ones to keep responsiveness.

In general, "fair thread scheduling" attempts to give each thread an equal amount of CPU time (regardless of how much CPU time all threads in a process get); and "fair process scheduling" attempts to give each process the same amount of CPU time (e.g. by giving threads belonging to different processes unequal amounts of CPU time). These are mutually exclusive - you can't have both (unless each process has the same number of threads).
Note that it's all a broken joke anyway. For example, if one thread gets 10 ms of time on a CPU that is running slow due to thermal throttling (and/or because another logical CPU in the same core is busy) and another thread gets 10 ms of time on a CPU that is running faster than normal (e.g. due to "turbo-boost" and/or because the other logical CPU in the core is not being used); then these threads have received an equal amount of CPU time but have not received anything that could be considered "fair" (because one thread might be able to get 20 times as much work done than the other).
Note that it's all unwanted anyway. For example, for a good OS threads would be given a priority to indicate how important the work they do is, and you don't want a high priority thread (doing very important work) to get the same "fair share" of CPU time as a low priority thread (doing irrelevant/unimportant work). For cases where two threads have equal priority you might (in theory) want them to get an "equal" amount of CPU time; but in practice this isn't common and threads block and unblock so often that it isn't worth caring about; and in practice it can lead to "two half finished jobs instead of one completed job and one unstarted job" scenarios that increases the average amount of time a job (e.g. request for work) takes to complete.

If the thread is the basic unit of scheduling (a generally safe assumption these days) then the process scheduler is the one to decide who to allocate the CPUs. How (and whether) it takes thread usage into account is entirely system specific. AND the behavior ma depends upon the type of process. For example, in VMS (and adopted in Windoze) realtime processes are treated differently than other types of processes.
In the VMS-type scheduling, a process with more threads gets more CPU by design. Better for an application to use more threads and for it to use more processes.
Keep in mind that a system may impose limits on the number of threads in a process.

Why processes are deprived of CPU for TOO long while busy looping in Linux kernel?

At first glance, my question might look bit trivial. Please bear with me and read completely.
I have identified a busy loop in my Linux kernel module. Due to this, other processes (e.g. sshd) are not getting CPU time for long spans of time (like 20 seconds). This is understandable as my machine has only single CPU and busy loop is not giving chance to schedule other processes.
Just to experiment, I had added schedule() after each iteration in the busy loop. Even though, this would be keeping the CPU busy, it should still let other processes run as I am calling schedule(). But, this doesn't seem to be happening. My user level processes are still hanging for long spans of time (20 seconds).
In this case, the kernel thread got nice value -5 and user level threads got nice value 0. Even with low priority of user level thread, I think 20 seconds is too long to not get CPU.
Can someone please explain why this could be happening?
Note: I know how to remove busy loop completely. But, I want to understand the behaviour of kernel here. Kernel version is 2.6.18 and kernel pre-emption is disabled.

The schedule() function simply invokes the scheduler - it doesn't take any special measures to arrange that the calling thread will be replaced by a different one. If the current thread is still the highest priority one on the run queue then it will be selected by the scheduler once again.
It sounds as if your kernel thread is doing very little work in its busy loop and it's calling schedule() every time round. Therefore, it's probably not using much CPU time itself and hence doesn't have its priority reduced much. Negative nice values carry heavier weight than positives, so the difference between a -5 and a 0 is quite pronounced. The combination of these two effects means I'm not too surprised that user space processes miss out.
As an experiment you could try calling the scheduler every Nth iteration of the loop (you'll have to experiment to find a good value of N for your platform) and see if the situation is better - calling schedule() too often will just waste lots of CPU time in the scheduler. Of course, this is just an experiment - as you have already pointed out, avoiding busy loops is the correct option in production code, and if you want to be sure your thread is replaced by another then set it to be TASK_INTERRUPTIBLE before calling schedule() to remote itself from the run queue (as has already been mentioned in comments).
Note that your kernel (2.6.18) is using the O(1) scheduler which existed until the Completely Fair Scheduler was added in 2.6.23 (the O(1) scheduler having been added in 2.6 to replace the even older O(n) scheduler). The CFS doesn't use run queues and works in a different way, so you might well see different behaviour - I'm less familiar with it, however, so I wouldn't like to predict exactly what differences you'd see. I've seen enough of it to know that "completely fair" isn't the term I'd use on heavily loaded SMP systems with a large number of both cores and processes, but I also accept that writing a scheduler is a very tricky task and it's far from the worst I've seen, and I've never had a significant problem with it on a 4-8 core desktop machine.

multi-threading in fedora

I've written a multi-thread program with ptgread. My CPU is dual core. But the program does not run as parallel. I attached system monitoring as following.
My question is, does support fedora13 multi-threading?

Your question is incomplete so this answer may not be effective. Will revise with more information.
However, few tips you should be able work out.
Are any threads waiting for the other?
IS there a dead-lock amongst the threads where both threads are effectively sleeping?
Are there too many I/O involved? (wait on sockets, read, write on disk, even heavy printfs includes this)
Does any of the thread has long sleeps (usleep, nanosleep anyone..)
If there are any of the above condition true, even if the CPU is available, because active instruction set need to wait till effective back log is done.
Second limitation of your question is the measurement. You have chart that is system through put. Even if you have one CPU, the thread switching can be so transparent because the thread switch within matter of few (10s or 100s) of millisecond. And if each of your thread is running on same CPU - you can never say see when these threads switched. Infact the graph you are seeing is shared not only by your 2 threads - but so many processes that are running in system.
But as i said - i can only be more effective if you give complete details.

Question about app with multiple threads in a few CPU-machine

Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.

There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.

Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip

The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)

This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).

One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string