Parallel threads in Linux

Parallel threads in Linux - linux

#include <iostream>
#include <time.h>
#include <pthread.h>
using namespace std;
void*genFunc2(void*val)
{
int i,j,k;
for(i=0;i<(1<<15);i++)
{
clock_t t1=clock();
for(j=0;j<(1<<20);j++)
{
for(k=0;k<(1<<10);k++)
{
}
}
clock_t t2=clock();
cout<<"t1:"<<t1<<" t2:"<<t2<<" t2-t1:"<<(t2-t1)/CLOCKS_PER_SEC<<endl;
}
}
int main()
{
cout<<"begin"<<endl;
pthread_t ntid1;pthread_t ntid2;pthread_t ntid3;pthread_t ntid4;
pthread_create(&ntid1,NULL,genFunc2,NULL);
pthread_create(&ntid2,NULL,genFunc2,NULL);
pthread_create(&ntid3,NULL,genFunc2,NULL);
pthread_create(&ntid4,NULL,genFunc2,NULL);
pthread_join(ntid1,NULL);pthread_join(ntid2,NULL);
pthread_join(ntid3,NULL);pthread_join(ntid4,NULL);
return 0;
}
I show my example above. When I just create one thread, it can print the time in 2 seconds. However, when I create four threads, each thread only prints its result in 15 seconds. Why?

This kind of algorithm can easily be parallelized using OpenMP, I suggest you check into it to simplify your code.
That being said, you use the clock() function to compute the execution time of your runs. This doesn't show the wallclock of your execution but the number of clock ticks that your CPU was busy executing your program. This is a bit strange because it may, per example, show 4 seconds while only 1 seconds have passed. This is perfectly logic on a 4 cores machine: if the 4 core were all 100% busy in your threads, you used 4 seconds of computing time (in core⋅seconds units). This is because you divide by the CLOCKS_PER_SEC constant, which is true only for a single core. Each of your core are running at CLOCKS_PER_SEC, effectively explaining most of the discrepancy between your experiments.
Furthermore, two notes to take into account with your code:
You should deactivate any kind of optimization (e.g.: -O0 on gcc), otherwise your inner loops may get removed depending on the compiler and other circumstances such as parallelization.
If your computer only have two real cores with Hyper-Threading activated (thus showing 4 cores in your OS), it may explain the remaining difference between your runs and my previous explanation.
To solve your problem with high resolution, you should use the function clock_gettime(CLOCK_MONOTONIC, &timer); as explained in this answer.

Related

Is the following code thread unsafe? Is so, how can I make a possible result more likely to come out?

Is the screen output of the following program deterministic? My understanding is that it is not, as it could be either 1 or 2 depending on whether the latest thread to pick up the value of i picks it up before or after the other thread has written 1 into it.
On the other, hand I keep seeing the same output as if each thread waits the previous to finish, as in I get 2 on screen in this case, or 100 if I create similar threads from t1 to t100 and join them all.
If the answer is no, the result is not deterministic, is there a way with a simple toy program to increase the odds that the one of the possible results comes out?
#include <iostream>
#include <thread>
int main() {
int i = 0;
std::thread t1([&i](){ ++i; });
std::thread t2([&i](){ ++i; });
t1.join();
t2.join();
std::cout << i << '\n';
}
(I'm compiling and running it like this: g++ -std=c++11 -lpthread prova.cpp -o exe && ./exe.)

Your are always seeing the same result because the first thread starts and runs its operations before the second one. This narrows the window for a race condition to occur.
But ultimately, there is still a chance that it occurs because the ++ operation is not atomic (read value, then increment, then write).
If the two threads start at the same time (eg: thread 1 slowed down due to the CPU being busy), then they will read the same value and the final result will be 1.

Analyzing Context Switch in Multithread [duplicate]

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}

First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.

This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Niceness of a Process and Benchmarking

Does the niceness of a process matter for (micro) benchmarking? My intuition says that starting a benchmark with nice -20 would produce more precise results, since less context switches occur for the benchmark.
On the other hand, many tools or library functions not only allow to retrieve wall time but also CPU time. Additionally, a benchmark machine should not have other resource intensive processes running at the same time, so there would not be too much competition anyway.
As a naive approach, I wrote a simple program that measures wall-time in the hope to see a difference when starting the process with different niceness values:
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
int main() {
struct timeval tval_before, tval_after, tval_result;
gettimeofday(&tval_before, NULL);
int i;
for (i = 0; i < 2000000000; i++) {
}
gettimeofday(&tval_after, NULL);
timersub(&tval_after, &tval_before, &tval_result);
printf("Time elapsed: %ld.%06ld\n", (long int)tval_result.tv_sec, (long int)tval_result.tv_usec);
return 0;
}
However, when measuring there is no consistent difference between starting the program with a high or low nice value. So my question: Does my benchmark not exercise a property influenced by niceness or is the niceness not relevant for this benchmark? Can the niceness value be relevant on a benchmark machine? And additionally: is the perf stats context-switches metrics suitable for measuring an impact of niceness?

linux RT scheduling

Our product is running linux 2.6.32, and we have some user space processes which run periodically - 'keep alive - like' processes. We don't place hard requirements on these processes - they just need to run once in several seconds and refresh some watchdog.
We gave these processes a scheduling class of RR or FIFO with max priority, and yet, we see many false positives - it seems like they don't get the CPU for several seconds.
I find it very odd because I know that Linux, while not being a RT OS, can still yield very good performance (I see people talking about orders of several msec) - and I can't even make that the process run once in 5 sec.
The logic of Linux RT scheduler seems very straight forward and simple, so I suspected that these processes get blocked by something else - I/O contention, interrupts or kernel threads taking too long - but now I'm not so sure:
I wrote a really basic program to simulate such a process - it wakes up every 1 second and measures the time spent since the last time it finished running. The time measurement doesn't include blocking on any I/O as far as I understand, so the results printed by this process reflect the behavior of the scheduler:
#include <sched.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/param.h>
#include <time.h>
#define MICROSECONDSINASEC 1000000
#define MILLISECONDSINASEC 1000
int main()
{
struct sched_param schedParam;
struct timeval now, start;
int spent_time = 0;
time_t current_time;
schedParam.sched_priority = sched_get_priority_max(SCHED_RR);
int retVal = sched_setscheduler(0, SCHED_RR, &schedParam);
if (retVal != 0)
{
printf("failed setting RT sched");
return 0;
}
gettimeofday(&start, 0);
start.tv_sec -= 1;
start.tv_usec += MICROSECONDSINASEC;
while(1)
{
sleep(1);
gettimeofday(&now, 0);
now.tv_sec -= 1;
now.tv_usec += MICROSECONDSINASEC;
spent_time = MILLISECONDSINASEC * (now.tv_sec - start.tv_sec) + ((now.tv_usec - start.tv_usec) / MILLISECONDSINASEC);
FILE *fl = fopen("output_log.txt", "aw");
if (spent_time > 1100)
{
time(&current_time);
fprintf(fl,"\n (%s) - had a gap of %d [msec] instead of 1000\n", ctime(&current_time), spent_time);
}
fclose(fl);
gettimeofday(&start, 0);
}
return 0;
}
I ran this process overnight on several machines - including ones that don't run our product (just plain Linux) and I still saw gaps of several seconds - even though I made sure the process DID get the priority - and I can't figure out why - technically this process should preeampt any other running process, so how can it wait so long to run?
A few notes:
I ran these processes mostly on virtual machines - so maybe there can be an intervention from the hypervisor. But in the past I've seen such behavior on physical machines as well.
Making the process RT did improve the results drastically, but not totally prevented the problem.
There are no other RT processes running on the machine except for the Linux migration and watchdog process (which I don't believe can cause starvation to my processes).
What can I do? I feel like I'm missing something very basic here.
thanks!

How to increase CPU frequency of newly spawned process

I've been working on a hobby project for a while (written in C), and it's still far from complete. It's very important that it will be fast, so I recently decided to do some benchmarking to verify that my way of solving the problem wouldn't be inefficient.
$ time ./old
real 1m55.92
user 0m54.29
sys 0m33.24
I redesigned parts of the program to significantly remove unnecessary operations, reduced memory cache misses and branch mispredictions. The wonderful Callgrind tool was showing me more and more impressive numbers. Most of the benchmarking was done without forking external processes.
$ time ./old --dry-run
real 0m00.75
user 0m00.28
sys 0m00.24
$ time ./new --dry-run
real 0m00.15
user 0m00.12
sys 0m00.02
Clearly I was at least doing something right. Yet running the program for real told a different story.
$ time ./new
real 2m00.29
user 0m53.74
sys 0m36.22
As you might have noticed, the time is mostly dependent on the external processes. I don't know what caused the regression. There's nothing really weird about it; just a traditional vfork/execve/waitpid done by a single thread, running the same programs in the same order.
Something had to be causing forking to be slow, so I made a small test (similar to the one below) that would only spawn the new processes and have none of the overhead associated with my program. Obviously this had to be the fastest.
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, const char **argv)
{
static const char *const _argv[] = {"/usr/bin/md5sum", "test.c", 0};
int fd = open("/dev/null", O_WRONLY);
dup2(fd, STDOUT_FILENO);
close(fd);
for (int i = 0; i < 100000; i++)
{
int pid = vfork();
int status;
if (!pid)
{
execve("/usr/bin/md5sum", (char*const*)_argv, environ);
_exit(1);
}
waitpid(pid, &status, 0);
}
return 0;
}
$ time ./test
real 1m58.63
user 0m68.05
sys 0m30.96
I guess not.
At this time I decided to vote performance for governor, and times got better:
$ for i in 0 1 2 3 4 5 6 7; do sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor";done
$ time ./test
real 1m03.44
user 0m29.30
sys 0m10.66
It seems like every new process gets scheduled on a separate core and it takes a while for it to switch to a higher frequency. I can't say why the old version ran faster. Maybe it was lucky. Maybe it (due to it's inefficiency) caused the CPU to choose a higher frequency earlier.
A nice side effect of changing governor was that compile times improved too. Apparently compiling requires forking many new processes. It's not a workable solution though, as this program will have to run on other people's desktops (and laptops).
The only way I found to improve the original times was to restrict the program (and child processes) to a single CPU by adding this code at the beginning:
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
Which actually was the fastest despite using the default "ondemand" governor:
$ time ./test
real 0m59.74
user 0m29.02
sys 0m10.67
Not only is it a hackish solution, but it doesn't work well in case the launched program uses multiple threads. There's no way for my program to know that.
Does anyone have any idea for how to get the spawned processes to run at high CPU clock frequency? It has to be automated and not require su priviliges. Though I've only tested this on Linux so far, I intend to port this to more or less all popular and impopular desktop OSes (and it will also run on servers). Any idea on any platform is welcome.

CPU frequency is seen (by the most OSs) as a system property. Thus, you can't change it without root rights. There exists some research on extensions to allow an adoption for specific programs; however since the energy/performance model differs even for the same general architecture, you will hardly find a general solution.
In addition, be aware that in order to guarantee fairness, the linux scheduler shares the execution time of perent and child processes for the first epoch of the child. This might have an impact to your problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parallel threads in Linux - linux

Related

Is the following code thread unsafe? Is so, how can I make a possible result more likely to come out?

Analyzing Context Switch in Multithread [duplicate]

Niceness of a Process and Benchmarking

linux RT scheduling

How to increase CPU frequency of newly spawned process

Categories

Resources