Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code? - multithreading

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.

I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies

A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

How do I ensure accurate results of stress tests?

I have written a script for a project that stress tests the cpu, the vm and the i/o whilst running vmstat, iostat and sar. The scripts all run for 30 seconds. My tutor has asked me however to ensure that the results are accurate? How can I ever be sure? Surely I just take the machine's word for it after running a few tests? The tests have been run for 60 seconds each and so have the commands to try and ensure a fair test, but how can I be sure that they are accurate according to my tutor's concerns? Any ideas?
The systems are server versions of Ubuntu 12.04, Debian 7 and Suse 11
There is no way to know which are your tutor's concerns, so you should ask him!
"accuracy" usually means that your test results should not be offset by a factor you're not taking into account, like some CPU features being disabled or not used, differences in software configuration, etc.
What is it that you evaluate, anyway? Evaluating CPU performance is not the same as evaluating a particular hardware system, which is yet different if you consider the software as well. Basically, you need to eliminate all differences which are not part of your evaluation, and make sure the rest of the configuration is representative (e.g. installing a modern OS which supports all the features the CPU provides).
And remember that in the end you will always take the machine's word for it, there's just no other way. All you can say is that you have considered all factors you're aware of, and hope that the factors remaining unknown don't have a big influence.

Profiling very long running tasks

How can you profile a very long running script that is spawning lots of other processes?
We have a job that takes a long time to run - 11 or more hours, sometimes more than 17 - so it runs on a Amazon EC2 instance.
(It is doing cufflinks DNA alignment and stuff.)
The job is executing lots of processes, scripts and utilities and such.
How can we profile it and determine which component parts of the job take the longest time?
A simple CPU utilisation per process per second would probably suffice. How can we obtain it?
There are many solutions to your question :
munin is a great monitoring tool that can scan almost everything in your system and make nice graph about it :). It is very easy to install and use it.
atop could be a simple solution, it can scan cpu, memory, disk regulary and you can store all those informations into files (the -W option), then you'll have to anaylze those file to detect the bottleneck.
sar, that can scan more than everything on your system, but a little more hard to interpret (you'll have to make the graph yourself with RRDtool for example)

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Resources