Application running slow after adding GDB symbols - Side effect? - linux

We have a application of size about 20MB in release mode. This application is meant to run on MIPS running Linux 2.6.12 The debug build of the same is about 42 MB, with optimization switched off and -g flag added. The additional 22 MB increase is only because of gdb debug symbols embedded into the application (no logs or print statements added).
Now will debug build run slower than the image compared to release mode, if yes why ?
Also AFAIK strip debug_image should give me release_image, but in my case I observe following.
debug_image = 42MB
strip debug_image = 24MB
release_image = 20MB
Why is there a difference between stripped debug_image and release_image ?
Are there any other side effects embedding gdb symbols into application ?

Now will debug build run slower than the image compared to release
mode, if yes why ?
Yes it will, if optimizations are off, which is true in your case.
Why is there a difference between stripped debug_image and
release_image ?
Because of optimizations are on in release, the whole image size is optimized, reducing it. This results to less image size in release than in debug.
Are there any other side effects embedding gdb symbols into
application ?
It will take longer time for gdb to load symbols and more memory will be required.

Related

Reduce application binary size in debug mode

When I compile GTK4 "Hello World" application in rust I get binary with size 192Mb for debug mode. I use old SSD and I worry about it's resource as I compile and debug very frequently. I tried -C prefer-dynamic flag, but size of binary become 188Mb only.
Is this way to make application binary size much smaller?
PS: I work in win10 and use MSYS2.
PPS: I don't have problem with release build's size. With -C link-arg=-s and lto = true the size is about 200kb

Crash in ID3DXConstantTable SetFloat/SetVector

We have a application with a render engine developed in Direct3d/C++. Recently we have come across a crash( access violation) involving ID3DXConstantTable SetFloat/SetVector and shows inside D3dx9_42.dll when we attached a debugger in release binaries with PDBs. One of the ways this crash vanishes when we reduce the number of D3dPOOL Rendertarget textures which are used but from estimating the GPU memory load its no where close to even half of the total available as we are using 3GB NVIDIA cards.
Suspected it to be some heap corruptions due to memory overwrites we went about code checking and following that we used the Application Verifier along with a debugger to root out of memory overwrites which might crash at a later stage of running.. We came across few issues which we ironed out. But still that crash remains at the very first frame render ID3DXConstantTable SetFloat/SetVector . More info :This is 32 bit application running with LARGEADDRESSAWARE flag. Any pointers ?
Well a moment later only i found out the issue I executed the application with the registry switch MEM_TOP_DOWN(AllocationPreference=0x100000) and it instantly crashed at the first setfloat() location.Then goto to know the constant table had to be retrieved using D3DXGetShaderConstantTableEx() with the D3DXCONSTTABLE_LARGEADDRESSAWARE flag :) Thanks

Is there a way to disable CPU cache (L1/L2) on a Linux system?

I am profiling some code on a Linux system (running on Intel Core i7 4500U) to obtain the time of ONLY the execution costs. The application is the demo mpeg2dec from libmpeg2. I am trying to obtain a probability distribution for the mpeg2 execution times. However we want to see the raw execution cost when cache is switched off.
Is there a way I can disable the cpu cache of my system via a Linux command, or via a gcc flag ? or even set the cpu (L1/L2) cache size to 0KB ? or even add some code changed to disable cache ? Of course, without modifying or rebuilding the kernel.
See this 2012 thread, someone posted a tiny kernel module source to disable cache through asm.
http://www.linuxquestions.org/questions/linux-kernel-70/disabling-cpu-caches-936077/
If disabling the cache is really necessary, then so be it.
Otherwise, to know how much time a process takes in terms of user or system "cycles", then I would recommend the getrusage() function.
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
You can call it before/after your loop/test and subtracted the values to get a good idea of how much time your process took, even if many other processes run in parallel on the same machine. The main problem you'd get is if your process start swapping. In that case your timings will be off.
double user_usage = usage.ru_utime.tv_sec + usage.ru_utime.tv_usec / 1000000.0;
double system_uage = usage.ru_stime.tv_sec + usage.ru_stime.tv_usec / 1000000.0;
This is really precise from my own experience. To increase precision, you could be root when running your test and give it a negative priority (-1 or -2 is enough.) Then it won't be swapped out until you call a function that may require it.
Of course, you still get the effect of the cache... assuming you do not handle very large amount of data with code that goes on and on (opposed to having a loop).

Slow linking in release with optimization disabled

I have a project (VC2005) which takes an unreasonable time (over 40 min) to link in Release while it is linked in less than 5 sec in Debug.
Both builds have incremental linking disabled and all files are located on the same drive.
Disabling Linker optimization in Release does not help.
Task manager never shows more than 150,000 K memory used by linker, which for a computer with 3GB of RAM is nothing.
I am building much bigger projects and never noticed such difference in building time.
Any ideas why this happens?
As remarked, the most probable reason is /LTCG (whole program optimization).
Other factors might be individual files compiled with /Gy (you should see some warnings in the output), or /OPT:REF, /OPT:ICF (check project properties/linker/optimization), or - very unlikely - you're unknowingly running some phase of PGO instrumentation.

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Resources