Analyzing Context Switch in Multithread [duplicate]

Analyzing Context Switch in Multithread [duplicate] - multithreading

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}

First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.

This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Related

Pinning a process to any CPU respecting affinity

Let's say I want to programmatically pin the current process to a single CPU, but I don't care which CPU that is.
One easy way to use sched_setaffinity with a fixed CPU number, probably 0 since there should always be a "CPU 0"1.
However, this approach fails if the affinity of the process has been set to a subset of the existing CPUs, not including the one you picked, e.g., by launching it with taskset.
So I want to pick "any CPU" to pin to, but only out of the CPUs that the current affinity mask allows. Here's one approach:
cpu_set_t cpu_set;
if (sched_getaffinity(0, sizeof(cpu_set), &cpu_set)) {
err("failed while getting existing cpu affinity");
}
for (int cpu = 0; cpu < CPU_SETSIZE; cpu++) {
if (CPU_ISSET(cpu, &cpu_set)) {
CPU_ZERO(cpu_set);
CPU_SET(cpu, &cpu_set);
}
}
int result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
Basically we get the current affinity mask, then loop over every possible CPU looking for the first one that is allowed, then pass a mask with only this CPU set to sched_setaffinity.
However, if the current affinity mask has changed between the get and set calls the set call will fail. Any way around this race condition?
1 Although CPU zero won't always be online.

You could use getcpu() to discover the cpu that your process is running within, and use the result to set affinity to that cpu:
unsigned mycpu=0;
if( -1 == getcpu(&mycpu,NULL,NULL) ) {
// handle error
}
Presumably any CPU affinity rules that are in place would be honored by the scheduler, thus the getcpu() call would return a CPU that the process is allowed to run on.
There's still the potential that the affinity set might change, but that seems like a very unlikely case, and the allowed CPUs might be affected at some point in the future, outside the control of the process in question.
I suppose you could detect the error in the sched_setaffinity() call, and retry the process until the setaffinity call works...

Considering that the affinity mask of the process can change at any moment, you can iteratively try to pin the process to the current CPU and stop when it is successful.
cpu_set_t cpu_set;
int cpu = 0;
int result = -1;
while (result<0){
cpu = sched_getcpu();
if (cpu>0){
CPU_ZERO(&cpu_set);
CPU_SET(cpu, &cpu_set);
result = sched_setaffinity(0, sizeof(cpu_set), &cpu_set);
}
}

printf in RT thread

I am writing a multi-thread application in Linux.
There is no RT patch in kernel, yet I use threads with priorities.
On checking the time it takes to execute printf , I measure different values every time I measure, although it is done in the highest priority thread :
if(clock_gettime(CLOCK_MONOTONIC, &start))
{ /* handle error */
}
for(int i=0; i< 1000; i++)
printf("hello world");
if(clock_gettime(CLOCK_MONOTONIC, &end))
{
/* handle error */
}
elapsedSeconds = TimeSpecToSeconds(&end) - TimeSpecToSeconds(&start);
Why does printf change timing and in non deterministic way , i.e. each
How should printf be used with RT threads ?
Can it be used inside RT thread or should it be totally avoided ?
Is write to disk should be treated in the same way as printf ? Should it be used only in separate low priority thread ?

printf under the hood triggers a non-realtime (even blocking) mechanism of the buffered IO.
It's not only non-deterministic, but opens the possibility of a priority inversion.
You should be very careful using it from a real time thread (I would say totally avoid it.
Normally, in a latency bound code you would use a wait-free binary audit into a chain of (pre-allocated or memory mapped) ring buffers and flush them using a background lower priority thread (or even a separate process).

How do I block all other processes on a Linux machine for XXX milliseconds?

I'm working with a Linux embedded SMP system that does audio I/O using ALSA and an external USB Audio device, using a 3.6.6. kernel. Problem: I'm getting infrequent (once every few weeks) system hiccups that are causing the audio stream to die. Although it's tough to be sure, the hiccups look like they lock up the entire system for a few dozens of milliseconds.
I can write ALSA code to recover after one of these hiccups, but since it's ALSA some trial and error will be required. Add that to having to wait weeks for a reoccurrence, and I'll be up a creek with a crowbar. I really need a way to cause the problem on demand.
I'd like to write a C program that runs as root and blocks all other processes on the system for a given number of milliseconds. I imagine it would involve disabling interrupts, doing a delay loop (since the timers will probably fail), and then restoring interrupts. But, I have to do it in such a way that the whole system doesn't go belly up.
Any ideas on how I would write such a program?

You could try raising the priority of your process and then using one of the "realtime" scheduling algorithms (e.g. SCHED_FIFO). This will help make sure that your process gets scheduled more consistently, even if other processes are running.

Well, based on CL's tip, and on information from http://www.tldp.org/HOWTO/text/IO-Port-Programming, I wrote the following code:
#include <stdio.h>
int main(int argc, char *argv[]) {
long i, j;
printf("About to lock system!\n");
// Boost I/O privilege level
iopl(3);
// Clear interrupt flag, masking interrupts
asm("cli");
// Wait about a second (with some hijinks to keep
// the loop from being optimized into oblivion)
j = 1;
for (i = 0; i < 250000000; i++) {
j *= i;
}
// Restore interrupt flag, restoring interrupts
asm("sti");
// Restore I/O privilege level
iopl(0);
printf("Phew! Survived!\n");
return 0;
}
When run as root, it works! Although not everything is suspended (and it's not clear to me what is and what isn't), enough locks up that my ALSA stream fails quite nicely. So, now I can stimulate the problem and ensure my code can handle it.
One note: I'd assumed that between the CLI and STI, system timing routines would fail due to the lack of interrupts. However, when just for the heck of it I tried usleep(), the timing code worked! But, the code as a whole actually didn't, because the call re-enabled interrupts, making the tool useless. Hence the use of a simple delay loop.

context switch measure time

I wonder if anyone of you know how to to use the function get_timer()
to measure the time for context switch
how to find the average?
when to display it?
Could someone help me out with this.
Is it any expert who knows this?

One fairly straightforward way would be to have two threads communicating through a pipe. One thread would do (pseudo-code):
for(n = 1000; n--;) {
now = clock_gettime(CLOCK_MONOTONIC_RAW);
write(pipe, now);
sleep(1msec); // to make sure that the other thread blocks again on pipe read
}
Another thread would do:
context_switch_times[1000];
while(n = 1000; n--;) {
time = read(pipe);
now = clock_gettime(CLOCK_MONOTONIC_RAW);
context_switch_times[n] = now - time;
}
That is, it would measure the time duration between when the data was written into the pipe by one thread and the time when the other thread woke up and read that data. A histogram of context_switch_times array would show the distribution of context switch times.
The times would include the overhead of pipe read and write and getting the time, however, it gives a good sense of big the context switch times are.
In the past I did a similar test using stock Fedora 13 kernel and real-time FIFO threads. The minimum context switch times I got were around 4-5 usec.

I dont think we can actually measure this time from User space, as in kernel you never know when your process is picked up after its time slice expires. So whatever you get in userspace includes scheduling delays as well. However, from user space you can get closer measurement but not exact always. Even a jiffy delay matters.

I believe LTTng can be used to capture detailed traces of context switch timings, among other things.

What can make a program run slower when using more threads?

This question is about the same program I previously asked about. To recap, I have a program with a loop structure like this:
for (int i1 = 0; i1 < N; i1++)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
histogram[bin_index(i1, i2, i3, i4)] += 1;
bin_index is a completely deterministic function of its arguments which, for purposes of this question, does not use or change any shared state - in other words, it is manifestly reentrant.
I first wrote this program to use a single thread. Then I converted it to use multiple threads, such that thread n runs all iterations of the outer loop where i1 % nthreads == n. So the function that runs in each thread looks like
for (int i1 = n; i1 < N; i1 += nthreads)
for (int i2 = 0; i2 < N; i2++)
for (int i3 = 0; i3 < N; i3++)
for (int i4 = 0; i4 < N; i4++)
thread_local_histogram[bin_index(i1, i2, i3, i4)] += 1;
and all the thread_local_histograms are added up in the main thread at the end.
Here's the strange thing: when I run the program with just 1 thread for some particular size of the calculation, it takes about 6 seconds. When I run it with 2 or 3 threads, doing exactly the same calculation, it takes about 9 seconds. Why is that? I would expect that using 2 threads would be faster than 1 thread since I have a dual-core CPU. The program does not use any mutexes or other synchronization primitives so two threads should be able to run in parallel.
For reference: typical output from time (this is on Linux) for one thread:
real 0m5.968s
user 0m5.856s
sys 0m0.064s
and two threads:
real 0m9.128s
user 0m10.129s
sys 0m6.576s
The code is at http://static.ellipsix.net/ext-tmp/distintegral.ccs
P.S. I know there are libraries designed for exactly this kind of thing that probably could have better performance, but that's what my last question was about so I don't need to hear those suggestions again. (Plus I wanted to use pthreads as a learning experience.)

To avoid further comments on this: When I wrote my reply, the questioner hasn't posted a link to his source yet, so I could not tailor my reply to his specific issues. I was only answering the general question what "can" cause such an issue, I never said that this will necessarily apply to his case. When he posted a link to his source, I wrote another reply, that is exactly only focusing on his very issue (which is caused by the use of the random() function as I explained in my other reply). However, since the question of this post is still "What can make a program run slower when using more threads?" and not "What makes my very specific application run slower?", I've seen no need to change my rather general reply either (general question -> general response, specific question -> specific response).
1) Cache Poisoning
All threads access the same array, which is a block of memory. Each core has its own cache to speed up memory access. Since they don't just read from the array but also change the content, the content is changed actually in the cache only, not in real memory (at least not immediately). The problem is that the other thread on the other core may have overlapping parts of memory cached. If now core 1 changes the value in the cache, it must tell core 2 that this value has just changed. It does so by invalidating the cache content on core 2 and core 2 needs to re-read the data from memory, which slows processing down. Cache poisoning can only happen on multi-core or multi-CPU machines. If you just have one CPU with one core this is no problem. So to find out if that is your issue or not, just disable one core (most OSes will allow you to do that) and repeat the test. If it is now almost equally fast, that was your problem.
2) Preventing Memory Bursts
Memory is read fastest if read sequentially in bursts, just like when files are read from HD. Addressing a certain point in memory is actually awfully slow (just like the "seek time" on a HD), even if your PC has the best memory on the market. However, once this point has been addressed, sequential reads are fast. The first addressing goes by sending a row index and a column index and always having waiting times in between before the first data can be accessed. Once this data is there, the CPU starts bursting. While the data is still on the way it sends already the request for the next burst. As long as it is keeping up the burst (by always sending "Next line please" requests), the RAM will continue to pump out data as fast as it can (and this is actually quite fast!). Bursting only works if data is read sequentially and only if the memory addresses grow upwards (AFAIK you cannot burst from high to low addresses). If now two threads run at the same time and both keep reading/writing memory, however both from completely different memory addresses, each time thread 2 needs to read/write data, it must interrupt a possible burst of thread 1 and the other way round. This issue gets worse if you have even more threads and this issue is also an issue on a system that has only one single-core CPU.
BTW running more threads than you have cores will never make your process any faster (as you mentioned 3 threads), it will rather slow it down (thread context switches have side effects that reduce processing throughput) - that is unlike you run more threads because some threads are sleeping or blocking on certain events and thus cannot actively process any data. In that case it may make sense to run more threads than you have cores.

Everything I said so far in my other reply holds still true on general, as your question was what "can"... however now that I've seen your actual code, my first bet would be that your usage of the random() function slows everything down. Why?
See, random keeps a global variable in memory that stores the last random value calculated there. Each time you call random() (and you are calling it twice within a single function) it reads the value of this global variable, performs a calculation (that is not so fast; random() alone is a slow function) and writes the result back there before returning it. This global variable is not per thread, it is shared among all threads. So what I wrote regarding cache poisoning applies here all the time (even if you avoided it for the array by having separated arrays per thread; this was very clever of you!). This value is constantly invalidated in the cache of either core and must be re-fetched from memory. However if you only have a single thread, nothing like that happens, this variable never leaves cache after it has been initially read, since it's permanently accessed again and again and again.
Further to make things even worse, glibc has a thread-safe version of random() - I just verified that by looking at the source. While this seems to be a good idea in practice, it means that each random() call will cause a mutex to be locked, memory to be accessed, and a mutex to be unlocked. Thus two threads calling random exactly the same moment will cause one thread to be blocked for a couple of CPU cycles. This is implementation specific, though, as AFAIK it is not required that random() is thread safe. Most standard lib functions are not required to be thread-safe, since the C standard is not even aware of the concept of threads in the first place. When they are not calling it the same moment, the mutex will have no influence on speed (as even a single threaded app must lock/unlock the mutex), but then cache poisoning will apply again.
You could pre-build an array with random numbers for every thread, containing as many random number as each thread needs. Create it in the main thread before spawning the threads and add a reference to it to the structure pointer you hand over to every thread. Then get the random numbers from there.
Or just implement your own random number generator if you don't need the "best" random numbers on the planet, that works with per-thread memory for holding its state - that one might be even faster than the system's built-in generator.
If a Linux only solution works for you, you can use random_r. It allows you to pass the state with every call. Just use a unique state object per thread. However this function is a glibc extension, it is most likely not supported by other platforms (neither part of the C standards nor of the POSIX standards AFAIK - this function does not exist on Mac OS X for example, it may neither exist in Solaris or FreeBSD).
Creating an own random number generator is actually not that hard. If you need real random numbers, you shouldn't use random() in the first place. Random only creates pseudo-random numbers (numbers that look random, but are predictable if you know the generator's internal state). Here's the code for one that produces good uint32 random numbers:
static uint32_t getRandom(uint32_t * m_z, uint32_t * m_w)
{
*m_z = 36969 * (*m_z & 65535) + (*m_z >> 16);
*m_w = 18000 * (*m_w & 65535) + (*m_w >> 16);
return (*m_z << 16) + *m_w;
}
It's important to "seed" m_z and m_w in a proper way somehow, otherwise the results are not random at all. The seed value itself should already be random, but here you could use the system random number generator.
uint32_t m_z = random();
uint32_t m_w = random();
uint32_t nextRandom;
for (...) {
nextRandom = getRandom(&m_z, &m_w);
// ...
}
This way every thread only needs to call random() twice and then uses your own generator. BTW, if you need double randoms (that are between 0 and 1), the function above can be easily wrapped for that:
static double getRandomDouble(uint32_t * m_z, uint32_t * m_w)
{
// The magic number below is 1/(2^32 + 2).
// The result is strictly between 0 and 1.
return (getRandom(m_z, m_w) + 1) * 2.328306435454494e-10;
}
Try to make this change in your code and let me know how the benchmark results are :-)

You are seeing cache line bouncing. I'm really surprised that you don't get wrong results, due to race conditions on the histogram buckets.

One possibility is that the time taken to create the threads exceeds the savings gained by using threads. I would think that N is not very large, if the elapsed time is only 6 seconds for a O(n^4) operation.
There's also no guarantee that multiple threads will run on different cores or CPUs. I'm not sure what the default thread affinity is with Linux - it may be that both threads run on a single core which would negate the benefits of a CPU-intensive piece of code such as this.
This article details default thread affinity and how to change your code to ensure threads run on specific cores.

Even though threads don't access the same elements of the array at the same, the whole array may sit in a few memory pages. When one core/processor writes to that page, it has to invalidate its cache for all other processors.
Avoid having many threads working over the same memory space. Allocate separate data for each thread to work upon, then join them together when the calculation finishes.

Off the top of my head:
Context switches
Resource contention
CPU contention (if they aren't getting split to multiple CPUs).
Cache thrashing

David,
Are you sure you run a kernel that supports multiple processors? If only one processor is utilized in your system, spawning additional CPU-intensive threads will slow down your program.
And, are you sure support for threads in your system actually utilizes multiple processors? Does top, for example, show that both cores in your processor utilized when you run your program?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string