How do I isolate 3 cores of a quadcore from Linux and use them for Halcon, exclusively? - linux

How do I isolate 3 cores of a quadcore from Linux and use them for Halcon, exclusively?
Here is what I've tried so far:
I configured Linux to only use core 0 of the quadcore CPU by boot option isolcpu=1,2,3
I started my multi-thread C++ program and let one thread configure Halcon with a few HSystem::SetSystem(). This is the halcon main thread. By default, the "thread_pool" option is set to "true" (but I also tried "false"). And, which is important, at first, this run-function of the halcon main-thread calls pthread_setaffinity(getpid(), sizeof(set), &set); for cpu_set_t set for which I added core 1, 2 and 3 with CPU_SET(index, &set).
Anyway, now a QR matrix code with "Maximum" mode should start several threads on core 1, 2 and 3. But it doesn't work. It only runs on core 1 with almost 90% CPU load, and core 2 and 3 stay at 0% CPU load (seen with top -H). This looks to me as if Halcon does miss an magic option to use all 3 cores.

Are you 100% sure this should run in parallel?
Could you try it with a different code type (ECC200). According to https://www.mvtec.com/products/halcon/documentation/release-notes-1911-0/ in the Speedup section we know for sure the ECC200 reader is parallelized internally by HALCON. If this reader runs in parallel on your system and the QR Code Reader doesn't i would assume the QR Code reader simply isnt parallelized by HALCON.

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

Will a 8 CPUs Cloud Machine run 8x faster than a 1 CPU CM without changes in the code?

I am a beginner and I have no clue, yet, about cloud computing nor multithreading nor multiprocessing.
I have a desktop PC with an i7 (4 cores) and I was wondering if a multi-CPUs cloud machine OR an 8+ cores machine would run ANY CODE faster than my PC without any changes in the code.
Does the machine handle the tasks distribution on the several CPUs (or the 8+ cores) by itself or is it required to adapt the code? (multithreading or multiprocessing)
For the sake of argument, let say I run a simple loop like below:
results = {}
for i in range(10**8):
results[i] = i**2
This takes about 67 sec on my PC (I was running something else at the same time so I'm not sure this is accurate but my timing is irrelevant anyway).
Would the exact same code be faster on a multi-CPUs machine or an 8+ cores machine compare to a single CPU 4cores machine?
If it is, in fact, required to make changes, I would appreciate any beginner links to learn about multiprocess or multithread.
Thank you for your help.
I'm no expert but I think it really depends on the platform that you're using to write and run your code. Some languages may support multi-threading/multi-processing natively and as such the code will run faster but others might not.
One thing is for certain you can't explicitly say that in %100 of the cases a machine with more cores/CPUs will run a given piece of code faster than a machine with lesser cores/CPUs.
Hope I helped clear things up.
Edit:
This medium post regarding multiprocessing\multithreading in python looks good - Multithreading vs Multiprocessing in Python 🐍
Python multiprocessing for dummies
A code will run faster only when it is written in parallel fashion. The snippet in the text of your question is not written parallel, so it won't run any faster.
When parallel program is being written, the programmer keeps in mind the target level of parallelization. A sequential program has parallelization level = 1. A program with N CPU-intensive threads would run most effectively on N processors (cores). A program with high parallelization level may execute slower on 2-4 core machine than sequential variant.

GNU make - how to simulate multiple simultaneous jobs

I know that to allow make to be multithreaded, I use the command make --jobs=X where X is usually equal to number of cores (or twice that or whatever).
I am debugging a makefile - actually consists of many makefiles - to work with the --jobs=X option. Here's an example of why it currently doesn't:
T1:
mkdir D1
output_makefile.bat > ./D1/makefile
T2:
cd D1
make
Executing this with --jobs=X will lead to a race condition because T1 is not specified as a dependency of T2 and eventually T2 will get built ahead of T1; most of the bugs I need to fix are of this variety.
If X in --jobs=X is greater than the number of ?logical or physical? cores, the number of jobs executed simultaneously will be capped at the number of ?logical or physical? cores.
My machine has 4 physical/8 logical cores but the build machine that will be running our builds will have as many as 64 cores.
So I'm concerned that just because my makefile (a) builds the final output correctly (b) runs without errors on my machine with --jobs=4 does not mean it'll run correctly and without errors with --jobs=64 on a 64-core machine.
Is there a tool that will simulate make executing in an environment that has more cores than the physical machine?
What about creating a virtual machine with 64 cores and run it on my 4-core machine; is that even allowed by VMPlayer?
UPDATE 1
I realized that my understanding of make was incorrect: the number of job slots make creates is equal to the --jobs=N argument and not the number of cores or threads my PC has.
However, this by itself doesn't necessarily mean that make will also execute those jobs in parallel even if I have fewer cores than jobs by using task-switching.
I need to confirm that ALL the jobs are being executed in parallel vs merely 'queued up' and waiting for the actively executing jobs to finish.
So I created a makefile with 16 targets - more than the num of threads or cores I have - and each recipe merely echos the name of the target a configurable number of times.
make.mk
all: 1 2 3 4 ... 14 15 16
<target X>:
#loop_output.bat $#
loop_output.bat
#FOR /L %%G IN (1,1,2048) DO #echo (%1-%%G)
The output will be something like
(16-1) <-- Job 16
(6-1400)
(12-334)
(1-1616) <-- Job 1
(4-1661)
(15-113)
(11-632)
(2-1557)
(10-485)
(7-1234)
(5-1530)
The format is Job#X-Echo#Y. The fact that I see (1-1616) after (16-1) means that make is indeed executing target 16 at the same time as target 1.
The alternative is that make finishes jobs (1-#of cores/threads) and then takes another chunk of jobs equal to #num cores/threads but that's not what's happening.
See my "UPDATE 1":
No special software or make tricks are required. Regardless of number of cores you have, Make will execute the jobs truly in parallel by spawning multiple processes and letting the OS multitask them just like any other process.
Windows PITFALL #1: The version of Gnu Make available on SourceForge is 3.81 which does NOT have the ability to even execute using --jobs. You'll have to download ver 4.2 and build it.
>
Windows PITFALL #2: make 4.2 source will fail to build because of some header that VS2008 (and older) doesn't have. The fix is easy: you have to replace the invocation of the "symbol not found" with its macro equivalent; it should be obvious what I'm talking about when you try to build it. (I forgot what the missing symbol was).

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

Matlab 2011a Use all Cores Available on 64 bit Linux?

Hi I've looked online but I can't seem to find the answer whether I need to do anything to make matlab use all cores? From what I understand multi-threading has been supported since 2007. On my machine matlab only uses one core #100% and the rest hang at ~2%. I'm using a 64 bit Linux (Mint 12). On my other computer which has only 2 cores and is 32 bit Matlab seems to be utilizing both cores #100%. Not all of the time but in sufficient number of cases. On the 64 bit, 4 core PC this never happens.
Do I have to do anything in 64 bit to get Matlab to use all the cores whenever possible? I had to do some custom linking after install as Matlab wasn't finding the libraries (eg. libc.so.6) because it wasn't looking in the correct places.
By standard, since the latest release, you can use 12 cores using the Parallel Computing Toolbox. Without this toolbox, I guess you're out of luck. Any additional cores could be accessed by the MATLAB Distributed Computing Server, where you actually pay per number of worker threads.
To make matlab use your multiple cores you have to do
matlabpool open
And it of course works better if you actually have multithreaded code (like using the spmd function or parfor loops)
More info at the Matlab homepage
MATLAB has only one single thread for Computation.
That said, multiple threads would be created for certain functions which use the multithreaded features of the BLAS libraries that it uses underneath.
Thus, you would only be able to gain a 'multi threaded' advantage if you are calling functions which use these multi-threaded blas libraries.
This link has information on the list of functions which are multithreaded.
Now for the use of your cores, that would depend on your OS. I believe the OS would have to load balance your threads to be used on all cores. One CANNOT set affinities to threads from within MATLAB. One can however set worker MATLAB processes to have affinities to cores from within the Parallel Computing toolbox.
However, you could always try setting the affinity for the MATLAB process to all your processors manually by the details available at the following link for Linux
Windows users can simply right click on the process in the task manager and set affinity.
My understanding is that this is only a request to the OS and is not a hard binding rule that the OS must adhere to.

Resources