Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)? - linux

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.

What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.

You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.

Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.

Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Related

Build on AMD, run on Intel?

If I cargo build --release a Rust binary on an AMD CPU and then run it on an Intel (or vice versa), could that be a problem (compatibility issues and/or considerable performance sacrifice)? I know we can use a target-cpu=<cpu> flag and that should result in a potentially more optimized machine code for the target platform. My questions are:
Practically speaking, if we build for one platform but run on the other, should we expect a significant runtime performance penalty?
If we build on AMD with target-cpu=intel (or vice versa), could the compilation itself be:
slower?
restricted in how well it could optimize for the target platform?
Note: Linux will be the OS for both compilation and running.
In general, if you just do a cargo build --release without further configuration, then you will get a binary that runs on any machine of the relevant architecture. That is, it will run on either an Intel or AMD CPU that's x86-64. It will also be optimized for a general CPU of that architecture, regardless of what type of CPU you build it on. The specific settings are going to depend on whatever rustc and LLVM are configured for, but unless you've done a custom build, that's usually the case.
Usually that is sufficient for most people's needs and building for a target CPU is unnecessary. However, if you specify a particular CPU, then it will be optimized for that CPU and may contain instructions that don't run on other CPUs. For example, the architectural definition for x86-64 doesn't contain things like AVX, which is a later addition, so if you compile for a CPU providing those instructions then rustc may use them, which may cause it to perform worse or not at all elsewhere.
It's impossible to say more about your particular situation without knowing more about your code and performance needs. My recommendation is just to ues cargo build --release and not to optimize for a specific CPU unless you have measured the code and determined that there is a particular section which is slow and which would benefit from that. Most people benefit greatly from the additional portability of the code and don't need CPU-specific optimizations.
Everything I've said here is also true of other sets of architectures. If you compile for aarch64-unknown-linux-gnu or riscv64gc-unknown-linux-gnu, it will build for a generic CPU of that type which works on all systems of that type unless you specify different options. The exception tends to be on systems like macOS, where it is specifically known that all CPUs that run on that OS have some specific set of features, and thus a compilation for e.g. x86-64 CPUs on macOS might optimize for Intel CPUs with given features since macOS only runs on hardware that uses those CPUs.

Linux kernel re-compilation too slow

I am compiling the Linux kernel in a VM(virtual box) with 2 out of 4GB and 4 out of 8 CPUs allocated. My initial compilation took around 8-9 hours, and I was using make -j4 optimisation too. Now I added a simple system call to the kernel and just ran the make -j4 and and it has been compiling for the past 3 hours. I thought that after the initial compilation, make would only compile the small changes but it seems to be compiling everything (mostly the drivers). Is there any way I can speed up this compilation process?
For example, is there anyway by which I can disable some of the drivers that I don't really need, for example if I just want to implement a simple system call, I don't really need all the networking drivers, and maybe that would speed it up? i.e. I just want the bare minimum functionality for my kernel to test my system calls.
Compiling the kernel will always take a very long time, unfortunately there's no way around that besides having a really good processor with a lot of multi threading, however in huge projects like this, ccache will help with compilation times tremendously, it's not perfect, but far better than just compiling objects.
You won't see the difference at the initial compilation, but it will speed up recompilation by using the cache it has generated instead of compiling most of what already has been compiled before.

How to collect some readable stack traces with perf?

I want to profile C++ program on Linux using random sampling that is described in this answer:
However, if you're in a hurry and you can manually interrupt your
program under the debugger while it's being subjectively slow, there's
a simple way to find performance problems.
The problem is that I can't use gdb debugger because I want to profile on production under heavy load and debugger is too intrusive and considerably slows down the program. However I can use perf record and perf report for finding bottlenecks without affecting program performance. Is there a way to collect a number of readable (gdb like) stack traces with perf instead of gdb?
perf does offer callstack recording with three different techniques
By default is uses the frame pointer (fp). This is generally supported and performs well, but it doesn't work with certain optimizations. Compile your applications with -fno-omit-frame-pointer etc. to make sure it works well.
dwarf uses a dump of the sack for each sample for post-processing. That has a significant performance penalty
Modern systems can use hardware-supported last branch record, lbr.
The stack is accessible in perf analysis tools such as perf report or perf script.
For more details check out man perf-record.

Static Analysis Tools for GLib and GDBus

Does anyone know of any tools or techniques for detecting memory leaks when using GLib and GDBus? I am relatively new to using both libraries and believe I am using the API's correctly, but it would be great if there was a tool that I could use to confirm that I am cleaning up my resources correctly. I have ran my code through various lint-type programs, but these likely do not detect anything abstracted away into a library.
I am looking for either a tool aimed specifically at GLib or GDBus or a tool that I could instrument so target these libraries? Maybe there are even some compile time flags that I can set for GLib or GDBus?
I just recently did some voodoo with glib/gdbus/libsoup and from my experience valgrind and valgrind/massif do a very good job (though not really static analysis but runtime analysis).
valgrind (use malloc even for g_slice_alloc/g_slice_new, makes valgrind less confused, gc-friendly nullifies all glib internal pointers)
G_DEBUG=gc-friendly G_SLICE=always-malloc valgrind ./yourapp
There will be still false positives in the output – use a supression file to hide them.
massif (use resident modules to prevent a lot of noise)
G_DEBUG=resident-modules valgrind --tool=massif --depth=10 --max-snapshots=1000 --alloc-fn=g_malloc --alloc-fn=g_realloc --alloc-fn=g_try_malloc --alloc-fn=g_malloc0 --alloc-fn=g_mem_chunk_alloc --threshold=0.01 ./yourapp --your --app --options
Use some visualization tool to make massifs output readable (couple of MB logs) massif-visualizer does a good job
Keep in mind that glib has a couple of MB of static allocated stuff (all the GObject type classes)
If you need to debug the libraries themself, there is no way around compiling them with debug flags (-g)

How to profile program on Linux platform without rebuilding?

I've used two profiling tools (VTune on windows and dbx (within sunstudio) on Solaris) which can profile program without rebuild them, and during profiling, the program just run at the same speed as normal. Both of these 2 features saved me a lot of time.
Now I want to know if there is some free tools available on Linux platform can do the same thing. I think I need profiling tools based on sampling. VTune is good but expensive ... I've heard of gprof and valgrind. But seems gprof need instrument the program (so we have to rebuild the program) and valgrind will slow down the program execution quite a lot. (from valgrind's introduction, Cachegrind runs programs about 20--100x slower than normal, and Callgrind which I need to profiling is based on Cachegrind)
For profiling, I just need to figure out the execution time of function calls so I can find out where the performance degradation happens. Actually I don't need many low level profiling information as Cachegrind provided...
oprofile is pretty good, but it can be difficult to set up. It also doesn't require you to rebuild your program.
Agreeing with Paul, I think Zoom is probably the best Linux profiler you can pay for.
However, for real results, I rely on this simple method, that I've been using since before profilers were invented.
Performance Counters for Linux is a new tool usable on kernels 2.6.31 and later; it's less intrusive (to both the program and the system as a whole) than valgrind or OProfile.
A nicer option than oprofile is Zoom. It's similar to Shark on Mac OS X, if you have ever used that. It's commercial ($199) but you can get a free trial from www.rotateright.com.

Resources