How is 999µs too short but 1000µs just right? - linux

When I run the following code, I get some output:
use std::thread::Thread;
static DELAY: i64 = 1000;
fn main() {
Thread::spawn(move || {
println!("some output");
});
std::io::timer::sleep(std::time::duration::Duration::microseconds(DELAY));
}
But if I set DELAY to 999, I get nothing. I think that 999 and 1000 are close enough not to cause such a difference, meaning there must be something else going on here. I've tried also with Duration::nanoseconds (999_999 and 1_000_000), and I see the same behavior.
My platform is Linux and I can reproduce this behavior nearly all the time: using 999 results in some output in way less than 1% of runs.
As a sidenote, I am aware that this approach is wrong.

The sleep function sleeps in increments of 1 millisecond, and if the number of milliseconds is less than 1 it does not sleep at all. Here is the relevant excerpt from the code:
pub fn sleep(&mut self, duration: Duration) {
// Short-circuit the timer backend for 0 duration
let ms = in_ms_u64(duration);
if ms == 0 { return }
self.inner.sleep(ms);
}
In your code, 999 microseconds made it not sleep at all, and the main thread ended before the spawned thread could print its output. With 1000 microseconds, i.e. 1 millisecond, the main thread slept, giving the spawned thread a chance to run.

The most probable thing is you have your kernel configured to have a TICK of 1000Hz (once clock interrupt per millisecond) Perhaps you can improve it recompiling on a finer grained clock or a tickless kernel and recompiling your kernel to allow finer clock resolution. The 1000Hz clock tick is nowadays standard in Linux kernels running on pc (and most of ARMs and embedded Linux).
This is not a newbie issue, so perhaps you'll have to ask for local help to reconfigure and recompile your kernel to cope with more time resolution.

Related

Analyzing Context Switch in Multithread [duplicate]

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}
First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.
This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Waking up a sleeping thread on sigint in Rust?

Consider the following simple Rust program:
use std::time::Duration;
use std::sync::atomic::{AtomicBool, Ordering};
use std::thread;
use ctrlc;
static running: AtomicBool = AtomicBool::new(true);
fn main() {
// Set up a thread that registers the sigint signal.
ctrlc::set_handler(|| {
running.store(false, Ordering::SeqCst);
});
// Loop as long as the signal has not been registered.
while running.load(Ordering::SeqCst) {
println!("Hello!");
thread::sleep(Duration::from_secs(10));
}
println!("Goodbye!");
}
It prints "Hello!" every ten seconds until someone press Ctrl+C, upon which it prints "Goodbye!" and exit. The problem is if Ctrl+C is pressed right after the thread goes to sleep. The user then has to wait for almost ten seconds until the program exits.
Is there any way to get around this, and somehow wake up the thread when the sigint signal is recieved? I'm prepared to change the ctrlc dependency for something else if it helps.
The only solution I have been able to come up with is to sleep during ten one second intervals instead, checking sigint before going back to sleep again at every wakeup. Is there a simpler and prettier way to do it?
As the doc says:
On Unix platforms, the underlying syscall may be interrupted by a spurious wakeup or signal handler. To ensure the sleep occurs for at least the specified duration, this function may invoke that system call multiple times. Platforms which do not support nanosecond precision for sleeping will have dur rounded up to the nearest granularity of time they can sleep for.
So, I propose you to use a more low level function directly, there is one crates that encapsule it shuteye, but I don't know if it's a good one.

How should I spawn threads for parallel computation?

Today, I got into multi-threading. Since it's a new concept, I thought I could begin to learn by translating a simple iteration to a parallelized one. But, I think I got stuck before I even began.
Initially, my loop looked something like this:
let stuff: Vec<u8> = items.into_iter().map(|item| {
some_item_worker(&item)
}).collect();
I had put a reasonably large amount of stuff into items and it took about 0.05 seconds to finish the computation. So, I was really excited to see the time reduction once I successfully implemented multi-threading!
When I used threads, I got into trouble, probably due to my bad reasoning.
use std::thread;
let threads: Vec<_> = items.into_iter().map(|item| {
thread::spawn(move || {
some_item_worker(&item)
})
}).collect(); // yeah, this is followed by another iter() that unwraps the values
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently. I guessed that it worked this way: once the iterator starts, threads are spawned. Whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently.
The result was that it took (after some re-runs) ~0.2 seconds to finish the same computation. Clearly, there's no parallel computing going on here. I don't know why the time increased by 4 times, but I'm sure that I've misunderstood something.
Since this isn't the right way, how should I go about modifying the program so that the threads execute concurrently?
EDIT:
I'm sorry, I was wrong about that ~0.2 seconds. I woke up and tried it again, when I noticed that the usual iteration ran for 2 seconds. It turned out that some process had been consuming the memory wildly. When I rebooted my system and tried the threaded iteration again, it ran for about 0.07 seconds. Here are some timings for each run.
Actual iteration (first one):
0.0553760528564 seconds
0.0539519786835 seconds
0.0564560890198 seconds
Threaded one:
0.0734670162201 seconds
0.0727820396423 seconds
0.0719120502472 seconds
I agree that the threads are indeed running concurrently, but it seems to consume another 20 ms to finish the job. My actual goal was to utilize my processor to run threads parallel and finish the job soon. Is this gonna be complicated? What should I do to make those threads run in parallel, not concurrent?
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently.
Only 4 may be running concurrently, but you can certainly create more than 4...
whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently (it was just a guess).
Whenever you have a guess, you should create an experiment to figure out if your guess is correct. Here's one:
use std::{thread, time::Duration};
fn main() {
let threads: Vec<_> = (0..500)
.map(|i| {
thread::spawn(move || {
println!("Thread #{i} started!");
thread::sleep(Duration::from_millis(500));
println!("Thread #{i} finished!");
})
})
.collect();
for handle in threads {
handle.join().unwrap();
}
}
If you run this, you will see that "Thread XX started!" is printed out 500 times, followed by 500 "Thread XX finished!"
Clearly, there's no parallel computing going on here
Unfortunately, your question isn't fleshed out enough for us to tell why your time went up. In the example I've provided, it takes a little less than 600 ms, so it's clearly not happening in serial!
Creating a thread has a cost. If the cost of the computation inside the thread is small enough, it'll be dwarfed by the cost of the threads or the inefficiencies caused by the threads.
For example, spawning 10 million threads to double 10 million u8s will probably not be worth it. Vectorizing it would probably yield better performance.
That said, you still might be able to get some improvement through parallelizing cheap tasks. But you want to use fewer threads through a thread pool w/ a small number of threads (so you have a (small) number of threads created at any given point, less CPU contention) or something more sophisticated (under the hood, the api is quite simple) like Rayon.
// Notice `.par_iter()` turns it into a `parallel iterator`
let stuff: Vec<u8> = items.par_iter().map(|item| {
some_item_worker(&item)
}).collect();

Accurate interval timer inside rendering loop linux

What is the best way to do the following in Linux
while(continue)
{
render(); //this function will take a large fraction of the framerate
wait(); //Wait until the full frame period has expired.
}
On windows, waitable timers seems to work pretty well (within 1 ms). One way of proceeding is to use a separate thread that just sleeps and triggers a sychronization mechanism. However I do not know how much overhead there are in this.
Note: Accuracy is more important than high frequency: A timer with frequency 1.000 kHz is preffered over a timer with 1 MHz.
Assuming you're looking for an answer in the C language:
I don't remember the precision, but I recall I used to use the setitimer() function when I needed good precision.
Here's an example of how to use it: http://docs.oracle.com/cd/E23824_01/html/821-1602/chap7rt-89.html

Expected duration of sleep(1)

What is the expected duration of a call to sleep with one as the argument? Is it some random time that doesn't exceed 1 second? Is it some random time that is at least one second?
Scenario:
Developer A writes code that performs some steps in sequence with an output device. The code is shipped and A leaves.
Developer B is advised from the field that steps j and k need a one-second interval between them. So he inserts a call to sleep(1) between those steps. The code is shipped and Developer B leaves.
Developer C wonders if the sleep(1) should be expected to sleep long enough, or whether a higher-resolution method should be used to make sure that at least 1000 milliseconds of delay occurs.
sleep() only guarantees that the process will sleep for at least the amount of time specified, so as you put it "some random time that is at least one second."
Similar behavior is mentioned in the man page for nanosleep:
nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed...
You might also find the answers in this question useful.
my man-page says this:
unsigned int sleep(unsigned int seconds);
DESCRIPTION
sleep() makes the calling thread sleep until seconds seconds have
elapsed or a signal arrives which is not ignored.
...
RETURN VALUE
Zero if the requested time has elapsed, or the number of seconds left
to sleep, if the call was interrupted by a signal handler.
so sleep makes the thread sleep, as long as you tell it, but a signals awakes it. I see no further guarantees.
if you need a better, more precise waiting time, then sleep is not good enough. There is nanosleep and (sound funny, but is true) select is the only posix portable way to sleep sub-second (or with higher precision), that I am aware of.

Resources