I need to use a parallel linear algebra on OSX and as painlessly as possible (i.e., at most I can use HomeBrew with my colleagues) factorization library due to the number of DOFs I have in my problems.
I've tried Armadillo, it supports sparse algebra which is what I need, I can link with the Accelerate framework, but it just solves linear problems, it doesn't support factorization AFAIK.
Next, MKL, but nothing I can do seems to trigger threading, even with TBB:
tbb::task_scheduler_init scheduler(4);
mkl_set_dynamic(true);
mkl_set_num_threads(4);
mkl_set_num_threads_local(4);
Eigen could be cool, but it seems that, like MKL, it won't run in parallel.
Do you have any suggestions?
OSX clang does not support openmp, which is required by multi-thread Eigen and MKL.
According to Intel® Math Kernel Library Link Line Advisor, MKL does not support TBB threading with clang.
But it seems to support Intel OpenMP library with the extra link option -liomp5. You could try if it works. If not, you may have to use another compiler such as gcc. You could find it in HomeBrew.
Related
Do we have a compiler for RISC-V vector instructions now? I have searched online and it seems we still don't have one.
It seems like on various RISC-V cores there are some jobs done already. Like PULP from ETH Zurich, Università di Bologna, they design SIMD-like extensions and also have corresponding GCC with modifications.
There is some preliminary work done on LLVM: https://github.com/rkruppe/rvv-llvm and there are multiple custom extensions that do similar things but are not following the (not yet frozen) standard. Most notably the RI5CY core from the PULP project has been used not only in academia but also commercial ASICs like on the gapuino (GAP8) and VEGABoard (RV32M1) that can be used with a GCC port.
Also, see some pointers regarding upstream and SiFive GCC support for the V extension here.
I have to install without root access some software (the gromacs simulation package) on a cluster server, on which jobs can be sent through slurm. I only have direct access to the front-end machine, and the home directory is shared among all the servers and front-end. I had to manually build and install locally:
gcc 4.8
automake, autoconf, cmake
openmpi
lapack libs
gromacs
Right now, I have installed all of this only on the front-end, which is an older Intel Xeon machine. The production servers have new AMD processor instead. This is my question: in order to achieve optimal performance, which parts of the aforementioned stack should be recompiled on the production servers? I guess it would make much sense to rebuild the final software (gromacs) and maybe the lapack libs, because of the different instruction sets and processor architecture, but I'm not exactly sure whether it would make any sense to rebuild the compiler or other parts of the system. Hence the question: does using a compiler (and the associated libraries) which have been built on a different machine result in higher execution times for the generated binaries?
In general, I'd expect a compiler to produce the same binaries if given the same output, so the answer would be no; but what about the libraries (as libstdc++) which have been compiled together with the compiler on the other machine?
thank you
In order to optimize gromacs (parallel molecular dynamics code), you can forget about recompiling the compileror the compilation tools: that's useless.
You should go after and check for optimizations. For Intel CPU using the Intel C compiler makes a difference. It's possible you observe some gains with AMDs as well.
Another alternative is to use the Portland Group compiler.
Regarding MPI, you need to be sure it's customized for your interconnect (for example, if you have infiniband, avoid to use the TCP standard version).
regarding lapack libraries, you need to install optimized lapack (ACML for AMDs, MKL for Intels. You can use with very good performance GOTO or ATLAS blas - they are included in many linux distros).
You have not mentioned FFT: they are indeed important for electromagnetics (Ewald summations) in the simulations: FFTW here is a good choice. You need to install the correct version for the processor or compile it on the target processor, because it performs a sort of "auto-tuning" in the compilation process.
Going below than this (tools, compilers) make no difference on the produced executables.
Building the GCC compiler already involves a four-stage bootstrap process, one of whose purposes is to QA the compiler by ensuring the last two stages produce the same output. So there is no reason to believe that a fifth stage will have any effect at all.
I want to parallelize my image processing codes using openMP. I have a doubt if OpenMP is supported by the latest versions of OpenCV like 2.4.4 or 2.4.5 versions. I know abt TBB but looks too complicated.
You might consider looking into cv::parallel_for_(). It provides a layer of abstraction for several parallelism mechanisms. If you have compiled OpenCV with OpenMP support, cv::parallel_for_() will use OpenMP when it can. Many OpenCV functions use cv::parallel_for_ intenally, but you might have to dig into the source to see whether parallel execution is actually happening.
I am developing an object-recognition system. I found that the critical part of my algo is the
extractor.compute();
(After having detector.detect() keypoints)
Is there any way to let compute the feature vector with more core? I can use up to 8 core.
Opencv already implements multithread framework for this. Check that you compiled opencv with threading option 'ON'. You should go for an opencv documentation reading, gpu::SURF_GPU may interest you.
You can run cmake again to see the compilation options you used.
Is there any CUDA library that performs comparison/search operation.
CUDA is an API for creating libraries that run on the NVidia GPU. Consequently, any operations that are to be performed must be custom programmed. There are not a wide range of open-source libraries available.
Programmers use 'C for CUDA' (C with NVIDIA extensions), compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU.
http://en.wikipedia.org/wiki/CUDA
You could look at Thrust, which includes a binary_search operation and is very easy to use.