Eigen ConjugateGradient solver not running multithreaded - multithreading

I have a sparse matrix A of size (91716x91716) with 3096684 nonzero elements, and a dense vector rhs. I am solving the system with a ConjugateGradient this way:
initParallel();
ConjugateGradient<SparseMatrix<double>, Lower|Upper> solver;
solver.compute(A);
const VectorXd response = solver.solve(rhs);
I'm compiling with:
g++ -O3 -I./eigen -fopenmp -msse2 -DEIGEN_TEST_SSE=ON -o example example.cpp
The executions, both with multi-threading and without, take approximately the same (around 1500 ms).
I am using Eigen version 3.2.8.
Is there any reason why the multi-threading is not performing better? I actually don't see the multithreading effect in my system monitor. Is there any other way to accelerate this process?
Edit:
A call to Eigen::nbThreads() responds 12 threads.

Document of the version 3.2.8
Currently, the following algorithms can make use of multi-threading: general matrix - matrix products, PartialPivLU
http://eigen.tuxfamily.org/dox/TopicMultiThreading.html
As dev document mentions more algorithms are using multi-threading, you need to change to Eigen3.3-beta1 or development branch to use the parallel version of ConjugateGradient.

Related

MSVC /arch:[instruction set] - SSE3, AVX, AVX2

Here is an example of a class which shows supported instruction sets. https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
I want to write three different implementations of a single function, each of them using different instruction set. But due to flag /ARCH:AVX2, for example, this app won't ever run anywhere but on 4th+ generation of Intel processors, so the whole point of checking is pointless.
So, question is: what exactly this flag does? Enables support or enables compiler optimizations using provided instruction sets?
In other words, can I completely remove this flag and keep using functions from immintrin.h, emmintrin.h, etc?
An using of option /ARCH:AVX2 allows to use YMM registers and AVX2 instructions of CPU by the best way. But if CPU is not support these instruction it will be a program crash. If you use AVX2 instructions and compiler flag /ARCH:SSE2 that will be a decreasing of performance (about 2x times).
So the best implementation when every implementation of your function is compiled with corresponding compiler options (/ARCH:AVX2, /ARCH:SSE2 and so on). The easiest way to do it - put your implementations (scalar, SSE, AVX) in different files and compile each file with specific compiler options.
Also it will be a good idea if you create a separate file where you can check CPU abilities and call corresponding implementation of your function.
There is an example of a library which does CPU checking and calling an one of implemented function.

Is multithreaded FFTW deterministic

I am getting slightly different results between runs in my program. It uses multi-threaded FFTW planned with FFTW_ESTIMATE flag. Is multi-threaded FFTW deterministic:
For fixed number of threads?
Between different numbers of threads used at different runs?
FFTW faq says, that FFTW_ESTIMATE flag results in same algorithm used between runs, but it does not explicitly say that it is deterministic in multi-threaded case.
The fftw documentation:
http://www.fftw.org/fftw3_doc/Thread-safety.html#Thread-safety
stipulates that only fftw_execute is reentrant. So it's hard to say without more info about your usage. Also:
"If you are configured FFTW with the --enable-debug or --enable-debug-malloc flags (see Installation on Unix), then fftw_execute is not thread-safe. These flags are not documented because they are intended only for developing and debugging FFTW, but if you must use --enable-debug then you should also specifically pass --disable-debug-malloc for fftw_execute to be thread-safe."

How to speed up compilation time in linux

While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?).
I think it does not make much sense but I am not sure as do not know how kernel solves such things.
Kind regards,
I use a non-recursive build system based on GNU make and I was wondering how well it scales.
I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.
In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:
Chart data is here.
The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9
Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.
We have a machine in our shop with the following characteristics:
256 core sparc solaris
~64gb RAM
Some of that memory used for a ram drive for /tmp
Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.
I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.
The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).
These are the things I've been able to do to speed up builds that have a noticeable impact:
Use := wherever possible
If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.
Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).
If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible
I once got it in my head to create macros of the form:
IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)
This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.
An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:
ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif
... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.
Where possible, avoid using custom macros inside recipes
The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.
I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.

Why do you have to use both a compiler flag and a run-time flag to get multicore-support in Haskell?

The Haskell wiki shows that you need to both set a compilation flag and a run-time flag to get multi-core support. Why isn't using the library enough to get the correct behavior at compile time? Why can't the run-time executable detect it was compiled with -threaded and use all cores on the system unless otherwise specified? I think turning these on by default would be better. Then there could be flags to turn off or modify these features.
http://www.haskell.org/haskellwiki/GHC/Concurrency#Multicore_GHC says:
Compile your program using the -threaded switch.
Run the program with +RTS -N2 to use 2 threads, for example. You should use a -N value equal to the number of CPU cores on your machine (not including Hyper-threading cores).
It seems somewhat onerous to have flags one must set both at compile time and again at run time. Are these flags vestigial remains of the effort to add concurrency to GHC?
While you're developing the program the extra +RTS ... shouldn't be a big deal (though I admit it struck me as odd when I first picked up Haskell). For the final (shipped) binary you can link it with static RTS options (GHC manual) by providing a C file containing char *ghc_rts_opts = "-N";.
EDIT: Updating this question for GHC 7.x, there is now a way to specify RTS options at compile time:
ghc -threaded -rtsopts -with-rtsopts=-N
This 1) uses the threaded runtime system 2) Enables the RTS options 3) Sets the RTS option to use as many threads as there are cores available (use -Nx where x is a number to manually control the number of OS threads).
Why can't the run-time executable detect it was compiled with -threaded and use all cores on the system unless otherwise specified?
That's an interesting feature request!
You could ask for it on the GHC feature tracker: http://hackage.haskell.org/trac/ghc/wiki/ReportABug
From GHC User guide (version 6.12.1):
Omitting x, i.e. +RTS -N -RTS, lets the runtime choose the value of x itself based on how many processors are in your machine.
I suppose there's no specific reason for this not to be the default, apart from authors' vision of what should defaults be. (Note that this also enables parallel GC, which maybe sometimes isn't what you wish to be by default.)

Why does my code run slower with multiple threads than with a single thread when it is compiled for profiling (-pg)?

I'm writing a ray tracer.
Recently, I added threading to the program to exploit the additional cores on my i5 Quad Core.
In a weird turn of events the debug version of the application is now running slower, but the optimized build is running faster than before I added threading.
I'm passing the "-g -pg" flags to gcc for the debug build and the "-O3" flag for the optimized build.
Host system: Ubuntu Linux 10.4 AMD64.
I know that debug symbols add significant overhead to the program, but the relative performance has always been maintained. I.e. a faster algorithm will always run faster in both debug and optimization builds.
Any idea why I'm seeing this behavior?
Debug version is compiled with "-g3 -pg". Optimized version with "-O3".
Optimized no threading: 0m4.864s
Optimized threading: 0m2.075s
Debug no threading: 0m30.351s
Debug threading: 0m39.860s
Debug threading after "strip": 0m39.767s
Debug no threading (no-pg): 0m10.428s
Debug threading (no-pg): 0m4.045s
This convinces me that "-g3" is not to blame for the odd performance delta, but that it's rather the "-pg" switch. It's likely that the "-pg" option adds some sort of locking mechanism to measure thread performance.
Since "-pg" is broken on threaded applications anyway, I'll just remove it.
What do you get without the -pg flag? That's not debugging symbols (which don't affect the code generation), that's for profiling (which does).
It's quite plausible that profiling in a multithreaded process requires additional locking which slows the multithreaded version down, even to the point of making it slower than the non-multithreaded version.
You are talking about two different things here. Debug symbols and compiler optimization. If you use the strongest optimization settings the compiler has to offer, you do so at the consequence of losing symbols that are useful in debugging.
Your application is not running slower due to debugging symbols, its running slower because of less optimization done by the compiler.
Debugging symbols are not 'overhead' beyond the fact that they occupy more disk space. Code compiled at maximum optimization (-O3) should not be adding debug symbols. That's a flag that you would set when you have no need for said symbols.
If you need debugging symbols, you gain them at the expense of losing compiler optimization. However, once again, this is not 'overhead', its just the absence of compiler optimization.
Is the profile code inserting instrumentation calls in enough functions to hurt you?
If you single-step at the assembly language level, you'll find out pretty quick.
Multithreaded code execution time is not always measured as expected by gprof.
You should time your code with an other timer in addition to gprof to see the difference.
My example: Running LULESH CORAL benchmark on a 2NUMA nodes INTEL sandy bridge (8 cores + 8 cores) with size -s 50 and 20 iterations -i, compile with gcc 6.3.0, -O3, I have:
With 1 thread running: ~3,7 without -pg and ~3,8 with it, but according to gprof analysis the code has ran only for 3,5.
WIth 16 threads running: ~0,6 without -pg and ~0,8 with it, but according to gprof analysis the code has ran for ~4,5 ...
The time in bold has been measured gettimeofday, outside the parallel region (start and end of main function).
Therefore, maybe if you would have measure your application time the same way, you would have seen the same speeduo with and without -pg. It is just the gprof measure which is wrong in parallel. In LULESH openmp version either way.

Resources