Get run time of parallel Prolog program - multithreading

I'm using SWI Prolog and have a program that goes like this:
main :-
statistics(runtime, [T0|_]),
thread_create(...),
thread_create(...),
thread_join(...),
thread_join(...),
statistics(runtime, [T1|_]),
T is T1 - T0,
print(T).
The problem is that for some reason T is always 0. However, if the thread_create / thread_join part is replaced with its equivalent serial code, I get a non-zero time.
A 'workaround' (but I think it's not 100% corret) I've found is using walltime instead of runtime as the first parameter to statistics/2, but I read that the wall time is like the actual time I could measured on, say, a real wall clock and should not be used to measure program execution time.
EDIT: Also, if I add a similar timing mechanism in each thread's goal, the timings are also non-zero. I assume that runtime only measures CPU time for the thread running it and evaluates to 0 in the first thread (the one running main) because that thread does very few other than delegate the real work to the newly created threads.

You are using the compatibility keys, mostly originating from Quintus Prolog when there were no threads and milliseconds were considered very accurate. Use the native keys. One is process_cputime, returning the CPU time of the entire process (all threads). There is also thread_cputime
returning the CPU time of all finished threads and just cputime returning the time of the calling thread. All values are floats expressing time in seconds. Resolution depends on the OS, typically quite accurate on modern OSes.

Related

Why is multi threading not faster on single core?

This question is not a duplicate of any question related to why multithreading is not faster on single-core, read the rest to figure out what I actually want to know
As far as I know, multithreading is only faster on a CPU with multiple cores, since each thread can run in parallel. However, as my understanding of how preemption and multithreading on single-core works, it should also be faster. The image below can describe what I mean better. Consider that our app is a simple loop that takes exactly 4 seconds to execute. In this example, the time slice is constant, but, I don't think it makes any difference because, in the end, all threads with the same priority will get equal time by the scheduler. The first timeline is single-threaded, but the second one has 4 threads. The cycle also means when the preemption ends and the scheduler goes back to the queue of threads from start. I/O has also been removed since that just adds complexity and even if it changes the results, let's assume I'm talking about some code that does not require any sort of I/O.
The red threads are threads related to my process, and others (black) are the ones for other processes and apps
There are a couple of questions here:
Why isn't it faster? What's wrong with my timeline?
What's that cycle point called?
Since the time slice is not fixed, does that means the Cycle time is fixed, or the time slice gets calculated and the cycle will be as much time required to spend the calculated time slice on each thread?
Is the slice time based on time or instruction? I mean, is it like 0.1 sec for each thread or like 10 instructions for each thread?
The CPU utilization is based on CPU time, so why isn't it always on 100% because when a thread's time reaches, it moves to the next thread, and if a thread stops on I/O, it does not wait but executes the next one, so the CPU always tries to find a thread to execute and minimalize the time spent IDLE. Is the time for I/O so significant that more than 50% of CPU time is spent doing nothing because all threads are waiting for something, mostly I/O and the CPU time is elapsed waiting for a thread to become in a ready state?
Note: This timeline is simplified, the time spent on I/O, thread creation, etc. is not calculated and it's assumed that other threads do not finish before the end of the timeline and have the same priority/nice value as our process

Thread scheduling behavior on Linux between steps with GDB compared to non-stepped execution with GDB

The GDB manual states that when using all-stop mode for debugging a multithreaded application, it is not possible to advance every thread in lock-step by exactly one statement. This makes sense since a step in GDB essentially allows all threads to be scheduled by the OS (however the OS decides to do this) until the next statement is reached by the thread for which the step was called.
My question is this: Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping (while still using GDB to keep as many variables constant as possible), or does the stepping muck with the scheduling enough that the advancement of threads is not (on average) the same as without stepping?
If the stepping does affect the behavior, how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program? Will recording and playing back be viable?
Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping
Not really. The "not stepping" average behavior will have threads either running out of their time quanta, or blocking on system calls. In the "stepping" case, the threads are unlikely to every run out of their time quanta (because the time distance between steps is likely to be very short). So the average behavior is likely to be very different.
how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program?
In general, you shouldn't care about multithreaded program flow. It is impossible to debug multithreaded programs that way.
When doing multithreaded programming, you must care about preserving invariants (every resource that can be accessed by multiple threads is protected against data races, etc.). If you do, your program will just work (TM). If you don't, you are unlikely to find all ways that the program misbehaves anyway.

Thread Quantum: How to compute it

I have been reading a few posts and articles regarding thread quanta (here, here and here). Apparently Windows allocate a fix number of CPU ticks for a thread quantum depending on the windows "mode" (server, or something else). However from the last link we can read:
(A thread quantum) between 10-200 clock ticks (i.e. 10-200 ms) under Linux, though some
granularity is introduced in the calculation
Is there any way to compute the quantum length on Linux?
Does that make any sense to compute it anyway? (since from my understanding threads can still be pre-empted, nothing forces a thread to run during the full duration of the quantum)
From a developer's perspective, I could see the interest in writing a program that could predict the running time of a program given its number of threads, and "what they do" (possibly removing all the testing to find the optimal number of threads would be kind of neat, although I am not sure it is the right approach)
On Linux, the default realtime quantum length constant is declared as RR_TIMESLICE, at least in 4.x kernels; HZ must be defined while configuring the kernel.
The interval between pausing the thread whose quantum has expired and resuming it may depend on a lot of things like, say, load average.
To be able to predict the running time at least with some degree of accuracy, give the target process realtime priority; realtime processes are scheduled following a round-robin algorithm, which is generally simpler and more predictable than the common Linux scheduling algo.
To get the realtime quantum length, call sched_rr_get_interval().

Ensuring even CPU time distribution among threads in Haskell

I have a planning algorithm written in Haskell which is tasked with evaluating a set of possible plans in a given amount of time, where the evaluation process is one which may be run for arbitrary amounts of time to produce increasingly accurate results. The natural and purportedly most efficient way to do this is to give each evaluation task its own lightweight Haskell thread, and have the main thread harvest the results after sleeping for the specified amount of time.
But in practice, invariably one or two threads will be CPU-starved for the entire available time. My own experimentation with semaphores/etc to control execution has shown this to be surprisingly difficult to fix, as I can't seem to force a given thread to stop executing (including using "yield" from Control.Concurrent.)
Is there a good known way to ensure that an arbitrary number of Haskell threads (not OS threads) each receive a roughly even amount of CPU-time over a (fairly short) span of wall-clock-time? Failing that, a good way to ensure that a number of threads executing an identical iteration fairly "take turns" on a given number of cores such that all cores are being used?
AFAIK, Haskell threads should all receive roughly equal amounts of CPU power as long as they are all actively trying to do work. The only reason that wouldn't happen is if they start making blocking I/O calls, or if each thread runs only for a few milliseconds or something.
Perhaps the problem you are seeing is actually that each thread just runs for a split second, yielding an unevaluated expression as its result, which the main thread then evaluates itself? If that were the case, it would look like the main thread is getting all the CPU time.

Why processes are deprived of CPU for TOO long while busy looping in Linux kernel?

At first glance, my question might look bit trivial. Please bear with me and read completely.
I have identified a busy loop in my Linux kernel module. Due to this, other processes (e.g. sshd) are not getting CPU time for long spans of time (like 20 seconds). This is understandable as my machine has only single CPU and busy loop is not giving chance to schedule other processes.
Just to experiment, I had added schedule() after each iteration in the busy loop. Even though, this would be keeping the CPU busy, it should still let other processes run as I am calling schedule(). But, this doesn't seem to be happening. My user level processes are still hanging for long spans of time (20 seconds).
In this case, the kernel thread got nice value -5 and user level threads got nice value 0. Even with low priority of user level thread, I think 20 seconds is too long to not get CPU.
Can someone please explain why this could be happening?
Note: I know how to remove busy loop completely. But, I want to understand the behaviour of kernel here. Kernel version is 2.6.18 and kernel pre-emption is disabled.
The schedule() function simply invokes the scheduler - it doesn't take any special measures to arrange that the calling thread will be replaced by a different one. If the current thread is still the highest priority one on the run queue then it will be selected by the scheduler once again.
It sounds as if your kernel thread is doing very little work in its busy loop and it's calling schedule() every time round. Therefore, it's probably not using much CPU time itself and hence doesn't have its priority reduced much. Negative nice values carry heavier weight than positives, so the difference between a -5 and a 0 is quite pronounced. The combination of these two effects means I'm not too surprised that user space processes miss out.
As an experiment you could try calling the scheduler every Nth iteration of the loop (you'll have to experiment to find a good value of N for your platform) and see if the situation is better - calling schedule() too often will just waste lots of CPU time in the scheduler. Of course, this is just an experiment - as you have already pointed out, avoiding busy loops is the correct option in production code, and if you want to be sure your thread is replaced by another then set it to be TASK_INTERRUPTIBLE before calling schedule() to remote itself from the run queue (as has already been mentioned in comments).
Note that your kernel (2.6.18) is using the O(1) scheduler which existed until the Completely Fair Scheduler was added in 2.6.23 (the O(1) scheduler having been added in 2.6 to replace the even older O(n) scheduler). The CFS doesn't use run queues and works in a different way, so you might well see different behaviour - I'm less familiar with it, however, so I wouldn't like to predict exactly what differences you'd see. I've seen enough of it to know that "completely fair" isn't the term I'd use on heavily loaded SMP systems with a large number of both cores and processes, but I also accept that writing a scheduler is a very tricky task and it's far from the worst I've seen, and I've never had a significant problem with it on a 4-8 core desktop machine.

Resources