Matlab 2011a Use all Cores Available on 64 bit Linux? - linux

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.

By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage

MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code?

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.
I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies
A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

Writing CUDA program for more than one GPU

I have more than one GPU and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically? Utilizing resources of all available GPUs for the program.
A utility that may periodically report the available resource and my program will launch as many threads to GPUs.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
I have more than GPUs and want to execute my kernels on them. Is there an API or software that can schedule/manage GPU resources dynamically.
For arbitrary kernels that you write, there is no API that I am aware of (certainly no CUDA API) that "automatically" makes use of multiple GPUs. Today's multi-GPU aware programs often use a strategy like this:
detect how many GPUs are available
partition the data set into chunks based on the number of GPUs available
successively transfer the chunks to each GPU, and launch the computation kernel on each GPU, switching GPUs using cudaSetDevice().
A program that follows the above approach, approximately, is the cuda simpleMultiGPU sample code. Once you have worked out the methodology for 2 GPUs, it's not much additional effort to go to 4 or 8 GPUs. This of course assumes your work is already separable and the data/algorithm partitioning work is "done".
I think this is an area of active research in many places, so if you do a google search you may turn up papers like this one or this one. Whether these are of interest to you will probably depend on your exact needs.
There are some new developments with CUDA libraries available with CUDA 6 that can perform certain specific operations (e.g. BLAS, FFT) "automatically" using multiple GPUs. To investigate this further, review the relevant CUBLAS XT documentation and CUFFT XT multi-GPU documentation and sample code. As far as I know, at the current time, these operations are limited to 2 GPUs for automatic work distribution. And these allow for automatic distribution of specific workloads (BLAS, FFT) not arbitrary kernels.
Secondly, I am using Windows+ Visual Studio for my development. I have read that CUDA is supported on Linux. what changes do I need to do in my program?
With the exception of the OGL/DX interop APIs CUDA is mostly orthogonal to choice of windows or linux as a platform. The typical IDE's are different (windows: nsight Visual Studio edition, Linux: nsight eclipse edition) but your code changes will mostly consist of ordinary porting differences between windows and linux. If you want to get started with linux, follow the getting started document.

How to make R use all processors?

I have a quad-core laptop running Windows XP, but looking at Task Manager R only ever seems to use one processor at a time. How can I make R use all four processors and speed up my R programs?
I have a basic system I use where I parallelize my programs on the "for" loops. This method is simple once you understand what needs to be done. It only works for local computing, but that seems to be what you're after.
You'll need these libraries installed:
library("parallel")
library("foreach")
library("doParallel")
First you need to create your computing cluster. I usually do other stuff while running parallel programs, so I like to leave one open. The "detectCores" function will return the number of cores in your computer.
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores = detectCores() - 1)
Next, call your for loop with the "foreach" command, along with the %dopar% operator. I always use a "try" wrapper to make sure that any iterations where the operations fail are discarded, and don't disrupt the otherwise good data. You will need to specify the ".combine" parameter, and pass any necessary packages into the loop. Note that "i" is defined with an equals sign, not an "in" operator!
data = foreach(i = 1:length(filenames), .packages = c("ncdf","chron","stats"),
.combine = rbind) %dopar% {
try({
# your operations; line 1...
# your operations; line 2...
# your output
})
}
Once you're done, clean up with:
stopCluster(cl)
The CRAN Task View on High-Performance Compting with R lists several options. XP is a restriction, but you still get something like snow to work using sockets within minutes.
As of version 2.15, R now comes with native support for multi-core computations. Just load the parallel package
library("parallel")
and check out the associated vignette
vignette("parallel")
I hear tell that REvolution R supports better multi-threading then the typical CRAN version of R and REvolution also supports 64 bit R in windows. I have been considering buying a copy but I found their pricing opaque. There's no price list on their web site. Very odd.
I believe the multicore package works on XP. It gives some basic multi-process capability, especially through offering a drop-in replacement for lapply() and a simple way to evaluate an expression in a new thread (mcparallel()).
On Windows I believe the best way to do this would probably be with foreach and snow as David Smith said.
However, Unix/Linux based systems can compute using multiple processes with the 'multicore' package. It provides a high-level function, 'mclapply', that performs a list comprehension across multiple cores. An advantage of the 'multicore' package is that each processor gets a private copy of the Global Environment that it may modify. Initially, this copy is just a pointer to the Global Environment, making the sharing of variable extremely quick if the Global Environment is treated as read-only.
Rmpi requires that the data be explicitly transferred between R processes instead of working with the 'multicore' closure approach.
-- Dan
If you do a lot of matrix operations and you are using Windows you can install revolutionanalytics.com/revolution-r-open for free, and this one comes with the intel MKL libraries which allow you to do multithreaded matrix operations. On Windows if you take the libiomp5md.dll, Rblas.dll and Rlapack.dll files from that install and overwrite the ones in whatever R version you like to use you'll have multithreaded matrix operations (typically you get a 10-20 x speedup for matrix operations). Or you can use the Atlas Rblas.dll from prs.ism.ac.jp/~nakama/SurviveGotoBLAS2/binary/windows/x64 which also work on 64 bit R and are almost as fast as the MKL ones. I found this the single easiest thing to do to drastically increase R's performance on Windows systems. Not sure why they don't come as standard in fact on R Windows installs.
On Windows, multithreading unfortunately is not well supported in R (unless you use OpenMP via Rcpp) and the available SOCKET-based parallelization on Windows systems, e.g. via package parallel, is very inefficient. On POSIX systems things are better as you can use forking there. (package multicore there is I believe the most efficient one). You could also try to use package Rdsm for multithreading within a shared memory model - I've got a version on my github that has unflagged -unix only flag and should work also on Windows (earlier Windows wasn't supported as dependency bigmemory supposedly didn't work on Windows, but now it seems it does) :
library(devtools)
devtools::install_github('tomwenseleers/Rdsm')
library(Rdsm)

Resources