Unexpectedly good performance with openmp parallel for loop - multithreading

I have edited my question after previous comments (especially #Zboson) for better readability
I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)
Here is my test code:
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main() {
const int n = 256*8192*100;
double *A, *B;
posix_memalign((void**)&A, 64, n*sizeof(double));
posix_memalign((void**)&B, 64, n*sizeof(double));
for (int i = 0; i < n; ++i) {
A[i] = 0.1;
B[i] = 0.0;
}
double start = omp_get_wtime();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
B[i] = exp(A[i]) + sin(B[i]);
}
double end = omp_get_wtime();
double sum = 0.0;
for (int i = 0; i < n; ++i) {
sum += B[i];
}
printf("%g %g\n", end - start, sum);
return 0;
}
When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]:
The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1:
The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.
Even more weirdly, the speed-up is 35 times!
Can anyone explain this?
I also tested this with much smaller arrays 8192*4, and see similar performance scaling.
In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):
for i in {1..128}; do
for k in {1..5}; do
export OMP_NUM_THREADS=$i;
echo -ne $i $k "";
./a.out;
done;
done > out
EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves!
EDIT: I tried #Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.
I tested your code with it here
http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2
Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.
#include <immintrin.h>
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
_mm256_zeroupper();
B[i] = sin(B[i]);
_mm256_zeroupper();
B[i] += exp(A[i]);
}
Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).
Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.
This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

Related

perf stat time elapsed for sys and user [duplicate]

$ time foo
real 0m0.003s
user 0m0.000s
sys 0m0.004s
$
What do real, user and sys mean in the output of time? Which one is meaningful when benchmarking my app?
Real, User and Sys process time statistics
One of these things is not like the other. Real refers to actual elapsed time; User and Sys refer to CPU time used only by the process.
Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).
User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.
Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.
User+Sys will tell you how much actual CPU time your process used. Note that this is across all CPUs, so if the process has multiple threads (and this process is running on a computer with more than one processor) it could potentially exceed the wall clock time reported by Real (which usually occurs). Note that in the output these figures include the User and Sys time of all child processes (and their descendants) as well when they could have been collected, e.g. by wait(2) or waitpid(2), although the underlying system calls return the statistics for the process and its children separately.
Origins of the statistics reported by time (1)
The statistics reported by time are gathered from various system calls. 'User' and 'Sys' come from wait (2) (POSIX) or times (2) (POSIX), depending on the particular system. 'Real' is calculated from a start and end time gathered from the gettimeofday (2) call. Depending on the version of the system, various other statistics such as the number of context switches may also be gathered by time.
On a multi-processor machine, a multi-threaded process or a process forking children could have an elapsed time smaller than the total CPU time - as different threads or processes may run in parallel. Also, the time statistics reported come from different origins, so times recorded for very short running tasks may be subject to rounding errors, as the example given by the original poster shows.
A brief primer on Kernel vs. User mode
On Unix, or any protected-memory operating system, 'Kernel' or 'Supervisor' mode refers to a privileged mode that the CPU can operate in. Certain privileged actions that could affect security or stability can only be done when the CPU is operating in this mode; these actions are not available to application code. An example of such an action might be manipulation of the MMU to gain access to the address space of another process. Normally, user-mode code cannot do this (with good reason), although it can request shared memory from the kernel, which could be read or written by more than one process. In this case, the shared memory is explicitly requested from the kernel through a secure mechanism and both processes have to explicitly attach to it in order to use it.
The privileged mode is usually referred to as 'kernel' mode because the kernel is executed by the CPU running in this mode. In order to switch to kernel mode you have to issue a specific instruction (often called a trap) that switches the CPU to running in kernel mode and runs code from a specific location held in a jump table. For security reasons, you cannot switch to kernel mode and execute arbitrary code - the traps are managed through a table of addresses that cannot be written to unless the CPU is running in supervisor mode. You trap with an explicit trap number and the address is looked up in the jump table; the kernel has a finite number of controlled entry points.
The 'system' calls in the C library (particularly those described in Section 2 of the man pages) have a user-mode component, which is what you actually call from your C program. Behind the scenes, they may issue one or more system calls to the kernel to do specific services such as I/O, but they still also have code running in user-mode. It is also quite possible to directly issue a trap to kernel mode from any user space code if desired, although you may need to write a snippet of assembly language to set up the registers correctly for the call.
More about 'sys'
There are things that your code cannot do from user mode - things like allocating memory or accessing hardware (HDD, network, etc.). These are under the supervision of the kernel, and it alone can do them. Some operations like malloc orfread/fwrite will invoke these kernel functions and that then will count as 'sys' time. Unfortunately it's not as simple as "every call to malloc will be counted in 'sys' time". The call to malloc will do some processing of its own (still counted in 'user' time) and then somewhere along the way it may call the function in kernel (counted in 'sys' time). After returning from the kernel call, there will be some more time in 'user' and then malloc will return to your code. As for when the switch happens, and how much of it is spent in kernel mode... you cannot say. It depends on the implementation of the library. Also, other seemingly innocent functions might also use malloc and the like in the background, which will again have some time in 'sys' then.
To expand on the accepted answer, I just wanted to provide another reason why real ≠ user + sys.
Keep in mind that real represents actual elapsed time, while user and sys values represent CPU execution time. As a result, on a multicore system, the user and/or sys time (as well as their sum) can actually exceed the real time. For example, on a Java app I'm running for class I get this set of values:
real 1m47.363s
user 2m41.318s
sys 0m4.013s
• real: The actual time spent in running the process from start to finish, as if it was measured by a human with a stopwatch
• user: The cumulative time spent by all the CPUs during the computation
• sys: The cumulative time spent by all the CPUs during system-related tasks such as memory allocation.
Notice that sometimes user + sys might be greater than real, as
multiple processors may work in parallel.
Minimal runnable POSIX C examples
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep syscall
Non-busy sleep as done by the sleep syscall only counts in real, but not for user or sys.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
Multiple threads
The following example does niters iterations of useless purely CPU-bound work on nthreads threads:
#define _XOPEN_SOURCE 700
#include <assert.h>
#include <inttypes.h>
#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
uint64_t niters;
void* my_thread(void *arg) {
uint64_t *argument, i, result;
argument = (uint64_t *)arg;
result = *argument;
for (i = 0; i < niters; ++i) {
result = (result * result) - (3 * result) + 1;
}
*argument = result;
return NULL;
}
int main(int argc, char **argv) {
size_t nthreads;
pthread_t *threads;
uint64_t rc, i, *thread_args;
/* CLI args. */
if (argc > 1) {
niters = strtoll(argv[1], NULL, 0);
} else {
niters = 1000000000;
}
if (argc > 2) {
nthreads = strtoll(argv[2], NULL, 0);
} else {
nthreads = 1;
}
threads = malloc(nthreads * sizeof(*threads));
thread_args = malloc(nthreads * sizeof(*thread_args));
/* Create all threads */
for (i = 0; i < nthreads; ++i) {
thread_args[i] = i;
rc = pthread_create(
&threads[i],
NULL,
my_thread,
(void*)&thread_args[i]
);
assert(rc == 0);
}
/* Wait for all threads to complete */
for (i = 0; i < nthreads; ++i) {
rc = pthread_join(threads[i], NULL);
assert(rc == 0);
printf("%" PRIu64 " %" PRIu64 "\n", i, thread_args[i]);
}
free(threads);
free(thread_args);
return EXIT_SUCCESS;
}
GitHub upstream + plot code.
Then we plot wall, user and sys as a function of the number of threads for a fixed 10^10 iterations on my 8 hyperthread CPU:
Plot data.
From the graph, we see that:
for a CPU intensive single core application, wall and user are about the same
for 2 cores, user is about 2x wall, which means that the user time is counted across all threads.
user basically doubled, and while wall stayed the same.
this continues up to 8 threads, which matches my number of hyperthreads in my computer.
After 8, wall starts to increase as well, because we don't have any extra CPUs to put more work in a given amount of time!
The ratio plateaus at this point.
Note that this graph is only so clear and simple because the work is purely CPU-bound: if it were memory bound, then we would get a fall in performance much earlier with less cores because the memory accesses would be a bottleneck as shown at What do the terms "CPU bound" and "I/O bound" mean?
Quickly checking that wall < user is a simple way to determine that a program is multithreaded, and the closer that ratio is to the number of cores, the more effective the parallelization is, e.g.:
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Sys heavy work with sendfile
The heaviest sys workload I could come up with was to use the sendfile, which does a file copy operation on kernel space: Copy a file in a sane, safe and efficient way
So I imagined that this in-kernel memcpy will be a CPU intensive operation.
First I initialize a large 10GiB random file with:
dd if=/dev/urandom of=sendfile.in.tmp bs=1K count=10M
Then run the code:
#define _GNU_SOURCE
#include <assert.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/sendfile.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *source_path, *dest_path;
int source, dest;
struct stat stat_source;
if (argc > 1) {
source_path = argv[1];
} else {
source_path = "sendfile.in.tmp";
}
if (argc > 2) {
dest_path = argv[2];
} else {
dest_path = "sendfile.out.tmp";
}
source = open(source_path, O_RDONLY);
assert(source != -1);
dest = open(dest_path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
assert(dest != -1);
assert(fstat(source, &stat_source) != -1);
assert(sendfile(dest, source, 0, stat_source.st_size) != -1);
assert(close(source) != -1);
assert(close(dest) != -1);
return EXIT_SUCCESS;
}
GitHub upstream.
which gives basically mostly system time as expected:
real 0m2.175s
user 0m0.001s
sys 0m1.476s
I was also curious to see if time would distinguish between syscalls of different processes, so I tried:
time ./sendfile.out sendfile.in1.tmp sendfile.out1.tmp &
time ./sendfile.out sendfile.in2.tmp sendfile.out2.tmp &
And the result was:
real 0m3.651s
user 0m0.000s
sys 0m1.516s
real 0m4.948s
user 0m0.000s
sys 0m1.562s
The sys time is about the same for both as for a single process, but the wall time is larger because the processes are competing for disk read access likely.
So it seems that it does in fact account for which process started a given kernel work.
Bash source code
When you do just time <cmd> on Ubuntu, it use the Bash keyword as can be seen from:
type time
which outputs:
time is a shell keyword
So we grep source in the Bash 4.19 source code for the output string:
git grep '"user\b'
which leads us to execute_cmd.c function time_command, which uses:
gettimeofday() and getrusage() if both are available
times() otherwise
all of which are Linux system calls and POSIX functions.
GNU Coreutils source code
If we call it as:
/usr/bin/time
then it uses the GNU Coreutils implementation.
This one is a bit more complex, but the relevant source seems to be at resuse.c and it does:
a non-POSIX BSD wait3 call if that is available
times and gettimeofday otherwise
1: https://i.stack.imgur.com/qAfEe.png**Minimal runnable POSIX C examples**
To make things more concrete, I want to exemplify a few extreme cases of time with some minimal C test programs.
All programs can be compiled and run with:
gcc -ggdb3 -o main.out -pthread -std=c99 -pedantic-errors -Wall -Wextra main.c
time ./main.out
and have been tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28, Linux kernel 4.18, ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB).
sleep
Non-busy sleep does not count in either user or sys, only real.
For example, a program that sleeps for a second:
#define _XOPEN_SOURCE 700
#include <stdlib.h>
#include <unistd.h>
int main(void) {
sleep(1);
return EXIT_SUCCESS;
}
GitHub upstream.
outputs something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
The same holds for programs blocked on IO becoming available.
For example, the following program waits for the user to enter a character and press enter:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
printf("%c\n", getchar());
return EXIT_SUCCESS;
}
GitHub upstream.
And if you wait for about one second, it outputs just like the sleep example something like:
real 0m1.003s
user 0m0.001s
sys 0m0.003s
For this reason time can help you distinguish between CPU and IO bound programs: What do the terms "CPU bound" and "I/O bound" mean?
multithreaded linkers: Can gcc use multiple cores when linking?
C++ parallel sort: Are C++17 Parallel Algorithms implemented already?
Real shows total turn-around time for a process;
while User shows the execution time for user-defined instructions
and Sys is for time for executing system calls!
Real time includes the waiting time also (the waiting time for I/O etc.)
In very simple terms, I like to think about it like this:
real is the actual amount of time it took to run the command (as if you had timed it with a stopwatch)
user and sys are how much 'work' the CPU had to do to execute the command. This 'work' is expressed in units of time.
Generally speaking:
user is how much work the CPU did to run to run the command's code
sys is how much work the CPU had to do to handle 'system overhead' type tasks (such as allocating memory, file I/O, ect.) in order to support the running command
Since these last two times are counting 'work' done, they don't include time a thread might have spent waiting (such as waiting on another process or for disk I/O to finish).
real, however, is a measure of actual runtime and not 'work', so it does include any time spent waiting (which is why sometimes real > usr+sys).
And finally, sometimes the reverse is true (usr+sys > real) for multi-threaded applications. This also arises because we are comparing 'work-time' to actual time. For example, if 3 processors each run continuously for 10 minutes to execute a command, you will get real = 10m but usr = 30m.
I want to mention some other scenario when the real-time is much much bigger than user + sys. I've created a simple server which respondes after a long time
real 4.784
user 0.01s
sys 0.01s
the issue is that in this scenario the process waits for the response which is not on the user site nor in the system.
Something similar happens when you run the find command. In that case, the time is spent mostly on requesting and getting a response from SSD.
Must mention that at least on my AMD Ryzen CPU, the user is always large than real in multi-threaded program(or single threaded program compiled with -O3).
eg.
real 0m5.815s
user 0m8.213s
sys 0m0.473s

Why same memcpy glibc implementations is faster on Linux and slower on Windows?

I noticed that memcpy is faster on Linux than Windows on the same hardware. I dual boot same box with Intel i7 4770 CPU and 16Gb RAM and run same c++ code compiled. I'm trying to bench memccpy with this code
#include <iostream>
#include <chrono>
#include <cstring>
typedef std::chrono::high_resolution_clock Clock;
int main() {
const int mb = 300;
int size = mb * 1024 * 1024 / sizeof(int);
auto buffer = new int[size];
srand(1);
for(int i = 0; i < size; i++) {
auto r = abs(rand()) % 2048;
buffer[i] = std::max<int>(r, 1);
}
auto buffer2 = new int[size];
const int repeats = 100;
for (int j = 2; j < mb; j+=2) {
auto start = Clock::now();
// Copy j Mb
int size = j * 1024 * 1024 / sizeof(int);
for (int i = 0; i < repeats; i++) {
int offset = 0;
while (offset < size) {
// Run memcpy on random sizes
int copySize = buffer[offset];
memcpy(buffer2, buffer, copySize * sizeof(int));
offset += copySize;
}
}
auto end = Clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
// Time taken per 1Mb
std::cout << j << "," << diff / j / repeats << std::endl;
}
}
Linux execution is 10% faster on average. It takes on average 20 micro sec/Mb on Linux and 22 micro sec/Mb on Windows. It is compiled in both case with gcc 10.2 m64 -O3 -mavx flags. The project I'm working on is OS database and there I see even bigger effects on memcpy and memset being faster on Linux with around 30% speedup on random length copies of small buffers.
Any idea why memcpy on Windows is different to Linux? I'd expect that memcpy is written on Assembly language and does not depend on OS but only on CPU architecture.
memcpy is part of the standard C library, and as such, is provided by the operating system on which you run your code (or an alternative provider if you use a different libc). For small copies of known sizes, GCC will often inline these operations because it can often avoid the overhead of a function call, but for large or unknown sizes, it will often use the system function.
In this case, you're seeing that glibc and Windows have different implementations, and glibc provides a better option. glibc does provide several different variants on different platforms based on what works best for a given CPU, but Windows may not do so, or may have a less optimized implementation.
In the past, glibc has even taken advantage of the fact that memcpy cannot have overlapping arguments and copied backwards on some CPUs, but that unfortunately broke some programs which did not comply with the standard, notably Adobe Flash Player. However, such an implementation was permissible and was indeed faster.
Instead of memcpy being slower, you could be finding that Windows has a different memory handling strategy. For example, it is common not to fault in all of the memory when it is first allocated. You may be finding that Linux, which in some cases will prefault subsequent pages, may be performing better here because of that optimization or a different one. If Windows has chosen not to do that, it could be because it complicates the code, or because it doesn't perform well on real-world use cases that are commonly run on Windows. What performs well in a synthetic benchmark may or may not match what performs well in the real world.
Ultimately, this is a quality of implementation issue. The standard requires that the functions it specifies behave in a specified way, and doesn't specify performance characteristics. Some projects choose to include optimized memcpy implementations if performance of that function is very important to them. Others choose not to and prefer to advise users to choose a platform that best meets their needs, taking into account that some platforms may perform better than others.

what is the best way to test the performance of a program in linux

Suppose I write a program, then I make some "optimization" for the code.
In my case I want to test how much the new feature std::move in C++11 can improve the performance of a program.
I would like to check whether the "optimization" do make sense.
Currently I test it by the following steps:
write a program(without std::move) , compile ,get binary file m1
optimize it(using std::move), compile, get binary file m2
use command "time" to compare the time consuming:
time ./m1 ; time ./m2
EDITED:
In order to get the statistical result, it was needed to run the test thousands of times.
Is there any better ways to do that or is there some tools can help on it ?
In general measuring performance using a simple time comparison, e.g. endTime-beginTime is always a good start, for a rough estimation.
Later on you can use a profiler, like Valgrind to get measures of how different parts of your program is performing.
With profiling you can measure space (memory) or time complexity of a program, usage of particular instructions or frequency/duration of function calls.
There's also AMD CodeAnalyst if you want more advanced profiling functionality using a GUI. It's free/open source.
There are several tools (among others) that can do profiling for you:
GNU gprof
google gperftools
intel VTune amplifier (part of the intel XE compiler package)
kernel perf
AMD CodeXL (successor of AMD CodeAnalyst)-
Valgrind
Some of them require a specific way of compilation or a specific compiler. Some of them are specifically good at profiling for a given processor-architecture (AMD/Intel...).
Since you seem to have access to C++11 and if you just want to measure some timings you can use std::chrono.
#include <chrono>
#include <iostream>
class high_resolution_timer
{
private:
typedef std::chrono::high_resolution_clock clock;
clock::time_point m_time_point;
public:
high_resolution_timer (void)
: m_time_point(clock::now()) { }
void restart (void)
{
m_time_point = clock::now();
}
template<class Duration>
Duration stopover (void)
{
return std::chrono::duration_cast<Duration>
(clock::now()-m_time_point);
}
};
int main (void)
{
using std::chrono::microseconds;
high_resolution_timer timer;
// do stuff here
microseconds first_result = timer.stopover<microseconds>();
timer.restart();
// do other stuff here
microseconds second_result = timer.stopover<microseconds>();
std::cout << "First took " << first_result.count() << " x 10^-6;";
std::cout << " second took " << second_result.count() << " x 10^-6.";
std::cout << std::endl;
}
But you should be aware that there's almost no sense in optimizing several milliseconds of overall runtime (if your program runtime will be >= 1s). You should instead time highly repetitive events in your code (if there are any, or at least those which are the bottlenecks). If those improve significantly (and this can be in terms of milli or microseconds) your overall performance will likely increase, too.

Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

The timeline generated by Nsight Visual Profile looks very strange. I don't write any transfer overlapping code, but you can see overlap between MemCpy and Compute kernels.
This makes me unable to debug the real overlapping code.
I use CUDA 5.0, Tesla M2090, Centos 6.3, 2x CPU Xeon E5-2609
Anyone has the similar problem? Does it occur only on certain linux distributions? How to fix it?
This is the code.
#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/device_ptr.h>
int main()
{
cublasHandle_t hd;
curandGenerator_t rng;
cublasCreate(&hd);
curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_MTGP32);
const size_t m = 5000, n = 1000;
const double alpha = 1.0;
const double beta = 0.0;
thrust::host_vector<double> h(n * m, 0.1);
thrust::device_vector<double> a(m * n, 0.1);
thrust::device_vector<double> b(n * m, 0.1);
thrust::device_vector<double> c(m * m, 0.1);
cudaDeviceSynchronize();
for (int i = 0; i < 10; i++)
{
curandGenerateUniformDouble(rng,
thrust::raw_pointer_cast(&a[0]), a.size());
cudaDeviceSynchronize();
thrust::copy(h.begin(), h.end(), b.begin());
cudaDeviceSynchronize();
cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N,
m, m, n, &alpha,
thrust::raw_pointer_cast(&a[0]), m,
thrust::raw_pointer_cast(&b[0]), n,
&beta,
thrust::raw_pointer_cast(&c[0]), m);
cudaDeviceSynchronize();
}
curandDestroyGenerator(rng);
cublasDestroy(hd);
return 0;
}
This is profile timeline captured.
Compute Capability 2.* (Fermi) devices are capable of both kernel level concurrency and kernel and copy concurrency. In order to trace concurrent kernels the kernel start and end timestamps are collected in a separate clock domain than the memory copy timestamps. The tool is responsible for correlating these different clocks. In your screenshot I believe there is a scaling factoring different (bad correlation) as you can see each memory copy is not off by a constant value but is off by a scaled offset.
If you use the option --concurrent-kernels off in nvprof I think the problem will disappear. When concurrent kernels are disabled the memory copy and kernel timing use the same source clock for timestamps.
Compute Capability 3.* (Kepler) and 5.* (Maxwell) have a different mechanism for timing compute kernels. For these devices it is possible in tools to see overlap with the end timestamp of a kernel and the start of a memory copy or kernel. The work does not overlap. There was a design decision in the tools between having potential for overlap (usually <500ns) or introduction of this as a constant overhead between dependent work. The tools decided to avoid introduction of the overhead at the cost of potentially showing very small level of overlap on serialized work.

memcpy() performance- Ubuntu x86_64

I am observing some weird behavior which I am not being able to explain. Following are the details :-
#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>
void memcpy_test() {
int size = 32*4;
char* src = new char[size];
char* dest = new char[size];
general_utility::ProcessTimer tmr;
unsigned int num_cpy = 1024*1024*16;
struct timespec start_time__, end_time__;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
for(unsigned int i=0; i < num_cpy; ++i) {
__builtin_memcpy(dest, src, size);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
delete [] src;
delete [] dest;
}
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?
EDIT 1:
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
EDIT 2:
Following are the detailed performance analysis (__builtin_memcpy()) :-
size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3
size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5
EDIT 3 :
This observation does not change even if I allocate int64_t/int32_t.
EDIT 4 :
size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in reporting this number, it was wrongly written as 26.5, now it is correct )
I have run these many times and numbers are consistent across each run.
I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?
Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):
rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8
And the version without gets compiled into 16 load/store pairs like:
mov 0x20(%rbp),%rdx
mov %rdx,0x20(%rbx)
Which apparently is faster on our computers.
If anything, I would expect -march=native to produce optimized code.
In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.
Is there other functions which could show this type of behavior ?
Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.
Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.
It's quite known issue (and really old one).
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052
look at some bottom comment in a bug report:
"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this
problem"
Looks like glibc's memcpy is far better than builtin...

Resources