calculating the exact time that program give without other background processes - linux

I am started to learn parallel programming and for calculating the performance i should know the accurate time that program seek.
so I want to measure the amount of time that my C program seek under the Linux but It just show me some divergent answer.
In my opinion it should be related to other processes get the time,by the way i am using this instructions :
double start ,end;
start = omp_get_wtime();
.
.
.
end = omp_get_wtime();
result = end- start;
Thank you in advance.

For conducting accurate benchmarks, it is imperative that the external influences are suppressed as much as possible. If your system has enough CPU cores, you can isolate some of them using kernel parameters and thus prevent any other process and/or kernel tasks from using those cores:
... isolcpus=3,4,5 nohz_full=3,4,5 rcu_nocbs=3,4,5 ...
Those parameters will almost completely isolate CPUs 3, 4, and 5 by preventing the OS scheduler from running processes on them by default (isolcpus), the kernel RCU system from running tasks of them (rcu_nocbs), and prevent the periodic scheduler timer ticks (nohz_full). Make sure that you do not isolate all CPUs!
You can now explicitly assign a process to those cores using taskset -c 3-5 ... or the mechanism built into the OpenMP runtime, e.g., export GOMP_CPU_AFFINITY="3,4,5" for GCC. Note that, even if you do not use dedicated isolated CPUs, simply turning on thread pinning with export OMP_PROCBIND=true or by setting GOMP_CPU_AFFINITY (KMP_AFFINITY for Intel) should decrease the run time divergence.

Why not just use clock?
clock_t start = clock();
/* do whatever you like here */
clock_t end = clock();
double total_time = (double)(end - start) / CLOCKS_PER_SEC;
or the function
getrusage(...)
...

Related

Analyzing Context Switch in Multithread [duplicate]

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}
First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.
This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

How do I block all other processes on a Linux machine for XXX milliseconds?

I'm working with a Linux embedded SMP system that does audio I/O using ALSA and an external USB Audio device, using a 3.6.6. kernel. Problem: I'm getting infrequent (once every few weeks) system hiccups that are causing the audio stream to die. Although it's tough to be sure, the hiccups look like they lock up the entire system for a few dozens of milliseconds.
I can write ALSA code to recover after one of these hiccups, but since it's ALSA some trial and error will be required. Add that to having to wait weeks for a reoccurrence, and I'll be up a creek with a crowbar. I really need a way to cause the problem on demand.
I'd like to write a C program that runs as root and blocks all other processes on the system for a given number of milliseconds. I imagine it would involve disabling interrupts, doing a delay loop (since the timers will probably fail), and then restoring interrupts. But, I have to do it in such a way that the whole system doesn't go belly up.
Any ideas on how I would write such a program?
You could try raising the priority of your process and then using one of the "realtime" scheduling algorithms (e.g. SCHED_FIFO). This will help make sure that your process gets scheduled more consistently, even if other processes are running.
Well, based on CL's tip, and on information from http://www.tldp.org/HOWTO/text/IO-Port-Programming, I wrote the following code:
#include <stdio.h>
int main(int argc, char *argv[]) {
long i, j;
printf("About to lock system!\n");
// Boost I/O privilege level
iopl(3);
// Clear interrupt flag, masking interrupts
asm("cli");
// Wait about a second (with some hijinks to keep
// the loop from being optimized into oblivion)
j = 1;
for (i = 0; i < 250000000; i++) {
j *= i;
}
// Restore interrupt flag, restoring interrupts
asm("sti");
// Restore I/O privilege level
iopl(0);
printf("Phew! Survived!\n");
return 0;
}
When run as root, it works! Although not everything is suspended (and it's not clear to me what is and what isn't), enough locks up that my ALSA stream fails quite nicely. So, now I can stimulate the problem and ensure my code can handle it.
One note: I'd assumed that between the CLI and STI, system timing routines would fail due to the lack of interrupts. However, when just for the heck of it I tried usleep(), the timing code worked! But, the code as a whole actually didn't, because the call re-enabled interrupts, making the tool useless. Hence the use of a simple delay loop.

How to increase CPU frequency of newly spawned process

I've been working on a hobby project for a while (written in C), and it's still far from complete. It's very important that it will be fast, so I recently decided to do some benchmarking to verify that my way of solving the problem wouldn't be inefficient.
$ time ./old
real 1m55.92
user 0m54.29
sys 0m33.24
I redesigned parts of the program to significantly remove unnecessary operations, reduced memory cache misses and branch mispredictions. The wonderful Callgrind tool was showing me more and more impressive numbers. Most of the benchmarking was done without forking external processes.
$ time ./old --dry-run
real 0m00.75
user 0m00.28
sys 0m00.24
$ time ./new --dry-run
real 0m00.15
user 0m00.12
sys 0m00.02
Clearly I was at least doing something right. Yet running the program for real told a different story.
$ time ./new
real 2m00.29
user 0m53.74
sys 0m36.22
As you might have noticed, the time is mostly dependent on the external processes. I don't know what caused the regression. There's nothing really weird about it; just a traditional vfork/execve/waitpid done by a single thread, running the same programs in the same order.
Something had to be causing forking to be slow, so I made a small test (similar to the one below) that would only spawn the new processes and have none of the overhead associated with my program. Obviously this had to be the fastest.
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, const char **argv)
{
static const char *const _argv[] = {"/usr/bin/md5sum", "test.c", 0};
int fd = open("/dev/null", O_WRONLY);
dup2(fd, STDOUT_FILENO);
close(fd);
for (int i = 0; i < 100000; i++)
{
int pid = vfork();
int status;
if (!pid)
{
execve("/usr/bin/md5sum", (char*const*)_argv, environ);
_exit(1);
}
waitpid(pid, &status, 0);
}
return 0;
}
$ time ./test
real 1m58.63
user 0m68.05
sys 0m30.96
I guess not.
At this time I decided to vote performance for governor, and times got better:
$ for i in 0 1 2 3 4 5 6 7; do sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor";done
$ time ./test
real 1m03.44
user 0m29.30
sys 0m10.66
It seems like every new process gets scheduled on a separate core and it takes a while for it to switch to a higher frequency. I can't say why the old version ran faster. Maybe it was lucky. Maybe it (due to it's inefficiency) caused the CPU to choose a higher frequency earlier.
A nice side effect of changing governor was that compile times improved too. Apparently compiling requires forking many new processes. It's not a workable solution though, as this program will have to run on other people's desktops (and laptops).
The only way I found to improve the original times was to restrict the program (and child processes) to a single CPU by adding this code at the beginning:
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
Which actually was the fastest despite using the default "ondemand" governor:
$ time ./test
real 0m59.74
user 0m29.02
sys 0m10.67
Not only is it a hackish solution, but it doesn't work well in case the launched program uses multiple threads. There's no way for my program to know that.
Does anyone have any idea for how to get the spawned processes to run at high CPU clock frequency? It has to be automated and not require su priviliges. Though I've only tested this on Linux so far, I intend to port this to more or less all popular and impopular desktop OSes (and it will also run on servers). Any idea on any platform is welcome.
CPU frequency is seen (by the most OSs) as a system property. Thus, you can't change it without root rights. There exists some research on extensions to allow an adoption for specific programs; however since the energy/performance model differs even for the same general architecture, you will hardly find a general solution.
In addition, be aware that in order to guarantee fairness, the linux scheduler shares the execution time of perent and child processes for the first epoch of the child. This might have an impact to your problem.

No speed-up with useless printf's using OpenMP

I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
printf("useless statement");
prime[i-2] = is_prime(i);
}
I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?
Speculating, but maybe the stdout is guarded by a lock?
In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).
My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.
I've found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.
Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.
Of course, if the printfs are useless, why are they in the loop at all?
To expand a bit on #Will's answer ...
I don't know whether stdout is guarded by a lock, but I'm pretty sure that writing to it is serialised at some point in the software stack. With the printf statements included OP is probably timing the execution of a lot of serial writes to stdout, not the parallelised execution of the loop.
I suggest OP modifies the printf statement to include i, see what happens.
As for the apparent speed-up on the dual-core machine -- was it statistically significant ?
You have here a parallel for loop, but the scheduling is unspecified.
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
There are some scheduling types defined in OpenMP 3.0 standard. They can be changed by setting OMP_SCHEDULE environment variable to type[,chunk] where
type is one of static, dynamic, guided, or auto
chunk is an optional positive integer that specifies the chunk size
Another way of changing schedule kind is calling openmp function omp_set_schedule
The is_prime function can be rather fast. /I suggest/
prime[i-2] = is_prime(i);
So, the problem can came from wrong scheduling mode, when a little number is executed before barrier from scheduling.
And the printf have 2 parts inside it /I consider glibc as popular Linux libc implementation/
Parse the format string and put all parameters into buffer
Write buffer to file descriptor (to FILE buffer, as stdout is buffered by glibc by default)
The first part of printf can be done in parallel, but second part is a critical section and it is locked with _IO_flockfile.
What were your timings - was it much slower with the printf's? In some tight loops the printf's might take a large fraction of the total computing time; for example if is_prime() is pretty fast, and therefore the performance is determined more by the number of calls to printf than the number of (parallelized) calls to is_prime().

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?
No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.
First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.
I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.
Implicit parallelism is probably what you are looking for.
If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.
A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.
The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.
With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.
If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});
Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Resources