Intel MKL 'ddiagemv' multithreaded

Intel MKL 'ddiagemv' multithreaded - multithreading

I wanted to compare my program (solver for SpMVM, matrix stored in DIA format, written in Fortran) with MKL.
I found the MKL routine ddiagemv and it's working perfect in sequential (I'm waaay faster than MKL!). But then, I developped an other version of my program, this time with OpenMP. So I looked around to find a way to make ddiagemv multi-threaded, and I don't find anything.
Even though I used export MKL_NUM_THREADS=4, export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=4, export OMP_NUM_THREADS=1, export MKL_DYNAMIC="FALSE" and export OMP_DYNAMIC="FALSE", the time results I get are exactly the same as if I was doing it in sequential.
Are some MKL routines able to be multi-threaded while others are not? If ddiagemv can be multi-threaded, how can I do that?

Related

Example code to check parallelism of HSL solver MA97 in IPOPT

I'm working on solving non-linear optimization problems. Currently I'm evaluating different algorithms to find out which one fits my problem best. I'm using MATLAB 2020b on Ubuntu 20.04 LTS.
I currently got IPOPT with the HSL solvers up and running. My problem consists of a few hundred variables (~500 at the moment). Switching to MA97 didn't show any performance improvements. Probably my problem is too small? Nevertheless, I'd like to check if the parallelism of MA97 compared to e.g. MA27 is working properly, hence, if I compiled everything correctly.
Is there any sample problem where I can verify if MA97 is working multi-threaded but MA27 not?

Several approaches suggested:
Try to debug from Matlab into the native code and see what IPOPT is calling into. This approach is tricky because Matlab itself uses OpenMP.
Use proc filesystem, if there are subdirectories under /proc/self/tasks, the process is multi-threaded. This approach has the same issues as above (Matlab backend will likely be using multi-threading).
Use environmental variables to limit the number of OpenMP threads (OMP_THREAD_LIMIT) and check for performance changes. Will need to measure this difference specifically around the call to IPOPT, as again, Matlab will be using OpenMP for its own functionality.
Matlab has a built-in profiler:
% start profiling
profile on
% your code ...
% launch profile viewer
profile viewer
Also, the IPOPT logs may be helpful. If the solver is multithreaded, there should be a difference between elapsed real-time and CPU time. This scales with parallelism, i.e
CPU time ∝ threads count * elapsed real-time
This is a rough approximation which is only valid up to the point you become resource-constrained on the number of threads.

I hope you already solved your problem. But I want to reply to help others. If you pass option linear_solver ma97 IPOPT should use HSL MA97 solver. I dont know how it can be done from MATLAB but if you add working directory "ipopt.opt" file IPOPT will read this file and apply specified options.
File content: (no equality sign)
linear_solver ma97

Multithreaded MKL and Eigen

I need to use a parallel linear algebra on OSX and as painlessly as possible (i.e., at most I can use HomeBrew with my colleagues) factorization library due to the number of DOFs I have in my problems.
I've tried Armadillo, it supports sparse algebra which is what I need, I can link with the Accelerate framework, but it just solves linear problems, it doesn't support factorization AFAIK.
Next, MKL, but nothing I can do seems to trigger threading, even with TBB:
tbb::task_scheduler_init scheduler(4);
mkl_set_dynamic(true);
mkl_set_num_threads(4);
mkl_set_num_threads_local(4);
Eigen could be cool, but it seems that, like MKL, it won't run in parallel.
Do you have any suggestions?

OSX clang does not support openmp, which is required by multi-thread Eigen and MKL.
According to Intel® Math Kernel Library Link Line Advisor, MKL does not support TBB threading with clang.
But it seems to support Intel OpenMP library with the extra link option -liomp5. You could try if it works. If not, you may have to use another compiler such as gcc. You could find it in HomeBrew.

Fastest Matrix Inverse Excel VBA

I have many equations with many unknowns (my data is in Excel) and currently I am using matrix method to solve them.
I use inbuilt MMULT (matrix multiply) and MINVERSE (matrix inverse) in following form :- Result = MMULT (MINVERSE(matrix1),matrix2)
Here lies the problem, my matrices are of the order of 2000 x 2000 or more, and Excel takes lot of time in doing inverse (matrix multiplication is very fast).
What is the best way for me to speed up the process? I don't mind exporting data to any external 3rd party program (compatible with Excel) and then importing the inversed matrix back to Excel.
I don't know much C / C++, but I feel if we compile the Matrix Inverse Function into a DLL and then use the same from excel VBA, maybe the speed will improve. Please help.
Please see following link :
Why is MATLAB so fast in matrix multiplication?
It is found that MATLAB has highest speed for matrix calculations. Can we use any such library in Excel / VBA ?
I have found certain libraries such as LAPACK, ARMADILO etc which are in C / C++ / C# or .NET. How can I use compiled versions of these libraries in my Excel VBA?

I am the author of the blog linked by John Coleman. There are quite a few options available, depending on your familiarity with compiling and linking to different languages. I'd say the main options are (in order of ease of implementation):
Use the Python Numpy and Scipy libraries with ExcelPython or xlwings (which now includes ExcelPython). The most recent post on the subject at my blog (including free downloads) is:
https://newtonexcelbach.wordpress.com/2016/01/04/xlscipy-python-scipy-for-excel-update-with-new-functions/
Advantages:
Free
Easy to implement (especially using Anaconda Python, which now includes xlwings).
No compilation required.
The Scipy code is fast, and includes packages for sparse matrices, which where applicable makes a huge difference to solving time, and the maximum size of matrix that can be handled.
Free and open-source spreadsheet implementation available on my blog.
Disadvantages:
The overhead in transferring large data sets from VBA to Python can be significant, in fact for small to medium sized matrices it is often greater than the solution time.
Use the Python version of the Alglib library. Search my blog for Alglib for recent examples using Python, and older examples using C++ and C#.
Advantages and disadvantages:
As for Numpy and Scipy, except there is a commercial version available as well, which may offer performance advantages (I haven't tried it).
Use Numpy/Scipy with Pyxll.
https://newtonexcelbach.wordpress.com/2013/09/10/python-matrix-functions-using-pyxll/ for sample spreadsheet.
Advantages:
The data transfer overhead should be much reduced.
More mature and documentation is better than the current xlwings.
Disadvantages:
Commercial package (but free for evaluation and non-commercial use)
Not open source.
Use open source C++ or Fortran packages with Visual Studio or other compiler, to produce xll based functions.
Advantages:
Potentially the best all-round performance.
Disadvantages:
More coding required.
Implementation more difficult, especially if you want to distribute to others.
32 bit/ 64 bit issues likely to be much harder to resolve.
As mentioned in the comments, matrix inversion is much slower than other matrix solution methods, but even allowing for that, the built in Excel Matrix functions are very slow compared with alternative compiled routines, and the advantages of installing one of the alternatives listed above is well worth the effort.

Measure function time execution without modification code

I have found some piece of code (function) in library which could be improved by the optimization of compiler (as the main idea - to find good stuff to go deep into compilers). And I want to automate measurement of time execution of this function by script. As it's low-level function in library and get arguments it's difficult to extract this one. Thus I want to find out the way of measurement exactly this function (precise CPU time) without library/application/environment modifications. Have you any ideas how to achieve that?
I could write wrapper but I'll need in near future much more applications for performance testing and I think to write wrapper for every one is very ugly.
P.S.: My code will run on ARM (armv7el) architecture, which has some kind of "Performance Monitor Control" registers. I have learned about "perf" in linux kernel. But don't know is it what I need?

It is not clear if you have access to the source code of the function you want to profile or improve, i.e. if you are able to recompile the considered library.
If you are using a recent GCC (that is 4.6 at least) on a recent Linux system, you could use profilers like gprof (assuming you are able to recompile the library) or better oprofile (which you could use without recompiling), and you could customize GCC for your needs.
Be aware that like any measurements, profiling may alter the observed phenomenon.
If you are considering customizing the GCC compiler for optimization purposes, consider making a GCC plugin, or better yet, a MELT extension, for that purpose (MELT is a high-level domain specific language to extend GCC). You could also customize GCC (with MELT) for your own specific profiling purposes.

Easiest way to use GPU for parallel for loop

I currently have a parallel for loop similar to this:
int testValues[16]={5,2,2,10,4,4,2,100,5,2,4,3,29,4,1,52};
parallel_for (1, 100, 1, [&](int i){
int var4;
int values[16]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
/* ...nested for loops */
for (var4=0; var4<16; var4++) {
if (values[var4] != testValues[var4]) break;
}
/* ...end nested loops */
}
I have optimised as much as I can to the point that the only thing more I can do is add more resources.
I am interested in utilising the GPU to help process the task in parallel. I have read that embarassingly parallel tasks like this can make use of a modern GPU quite effectively.
Using any language, what is the easiest way to use the GPU for a simple parallel for loop like this?
I know nothing about GPU architectures or native GPU code.

as Li-aung Yip said in comments, the simplest way to use a GPU is with something like Matlab that supports array operations and automatically (more or less) moves those to the GPU. but for that to work you need to rewrite your code as pure matrix-based operations.
otherwise, most GPU use still requires coding in CUDA or OpenCL (you would need to use OpenCL with an AMD card). even if you use a wrapper for your favourite language, the actual code that runs on the GPU is still usually written in OpenCL (which looks vaguely like C). and so this requires a fair amount of learning/effort. you can start by downloading OpenCL from AMD and reading through the docs...
both those options require learning new ideas, i suspect. what you really want, i think, is a high level, but still traditional-looking, language targeted at the gpu. unfortunately, they don't seem to exist much, yet. the only example i can think of is theano - you might try that. even there, you still need to learn python/numpy, and i am not sure how solid the theano implementation is, but it may be the least painful way forwards (in that it allows a "traditional" approach - using matrices is in many ways easier, but some people seem to find that very hard to grasp, conceptually).
ps it's not clear to me that a gpu will help your problem, btw.

You might want to check out array fire.
http://www.accelereyes.com/products/arrayfire
If you use openCL, you need to download separate implementations for different device vendors, intel, AMD, and Nvidia.

You might want to look into OpenACC which enables parallelism via directives. You can port your codes (C/C++/Fortran) to heterogeneous systems while maintaining a source code that still runs well on a homogeneous system. Take a look into this introduction video. OpenACC is not GPU programming, but expressing parallelism into your code, which may be helpful to achieve performance improvements without too much knowledge in low-level languages such as CUDA or OpenCL. OpenACC is available in commercial compilers from PGI, Cray, and CAPS (PGI offers new users a free 30 day trial).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string