GNU make - how to simulate multiple simultaneous jobs - multithreading

I know that to allow make to be multithreaded, I use the command make --jobs=X where X is usually equal to number of cores (or twice that or whatever).
I am debugging a makefile - actually consists of many makefiles - to work with the --jobs=X option. Here's an example of why it currently doesn't:
T1:
mkdir D1
output_makefile.bat > ./D1/makefile
T2:
cd D1
make
Executing this with --jobs=X will lead to a race condition because T1 is not specified as a dependency of T2 and eventually T2 will get built ahead of T1; most of the bugs I need to fix are of this variety.
If X in --jobs=X is greater than the number of ?logical or physical? cores, the number of jobs executed simultaneously will be capped at the number of ?logical or physical? cores.
My machine has 4 physical/8 logical cores but the build machine that will be running our builds will have as many as 64 cores.
So I'm concerned that just because my makefile (a) builds the final output correctly (b) runs without errors on my machine with --jobs=4 does not mean it'll run correctly and without errors with --jobs=64 on a 64-core machine.
Is there a tool that will simulate make executing in an environment that has more cores than the physical machine?
What about creating a virtual machine with 64 cores and run it on my 4-core machine; is that even allowed by VMPlayer?
UPDATE 1
I realized that my understanding of make was incorrect: the number of job slots make creates is equal to the --jobs=N argument and not the number of cores or threads my PC has.
However, this by itself doesn't necessarily mean that make will also execute those jobs in parallel even if I have fewer cores than jobs by using task-switching.
I need to confirm that ALL the jobs are being executed in parallel vs merely 'queued up' and waiting for the actively executing jobs to finish.
So I created a makefile with 16 targets - more than the num of threads or cores I have - and each recipe merely echos the name of the target a configurable number of times.
make.mk
all: 1 2 3 4 ... 14 15 16
<target X>:
#loop_output.bat $#
loop_output.bat
#FOR /L %%G IN (1,1,2048) DO #echo (%1-%%G)
The output will be something like
(16-1) <-- Job 16
(6-1400)
(12-334)
(1-1616) <-- Job 1
(4-1661)
(15-113)
(11-632)
(2-1557)
(10-485)
(7-1234)
(5-1530)
The format is Job#X-Echo#Y. The fact that I see (1-1616) after (16-1) means that make is indeed executing target 16 at the same time as target 1.
The alternative is that make finishes jobs (1-#of cores/threads) and then takes another chunk of jobs equal to #num cores/threads but that's not what's happening.

See my "UPDATE 1":
No special software or make tricks are required. Regardless of number of cores you have, Make will execute the jobs truly in parallel by spawning multiple processes and letting the OS multitask them just like any other process.
Windows PITFALL #1: The version of Gnu Make available on SourceForge is 3.81 which does NOT have the ability to even execute using --jobs. You'll have to download ver 4.2 and build it.
>
Windows PITFALL #2: make 4.2 source will fail to build because of some header that VS2008 (and older) doesn't have. The fix is easy: you have to replace the invocation of the "symbol not found" with its macro equivalent; it should be obvious what I'm talking about when you try to build it. (I forgot what the missing symbol was).

Related

Monitor the CPU usage of an OpenFOAM simulation running on a slurm job

I'm running an OpenFOAM simulation on a cluster. I have used the Scotch decomposition method and my decomposeParDict looks like this:
FoamFile
{
version 2.0;
format ascii;
class dictionary;
object decomposeParDict;
}
numberOfSubdomains 6;
method scotch;
checkMesh and decomposePar finish with no issues. I have assigned 6 nodes to the slurm by
srun -N6 -l sonicFoam
and the solver runs smoothly without any errors.
The issue is the solution speed is not improved in comparison to the non-parallel simulation I ran before. I want to monitor the CPU usage to see if all of the 6 nodes I have assigned are similarly loaded. The squeue --user=foobar command return the jobNumber and list of nodes assigned (NODELIST(REASON)) which looks like this:
foo,bar[061-065]
from sinfo command these nodes are both in debug and main* PARTITIONs (which I have absolutely no idea what it means!).
This post says that you can use the sacct or sstat commands to monitor CPU time and memory usage of a slurm job. But when I run
sacct --format="CPUTime,MaxRSS"
it gives me:
CPUTime MaxRSS
---------- ----------
00:00:00
00:00:00
00:07:36
00:00:56
00:00:26
15:26:24
which I can not understand. And when I specify the job number by
sacct --job=<jobNumber> --format="UserCPU"
The return is empty. So my questions are
Is my simulation loading all nodes or is it running on one or two and the rest are free?
am I running the right commands? if yes what those numbers mean? how they represent the CPU usage per node?
If not then what are the right --format="..."s for sacct and/or sstat (or maybe other slurm commands) to get the CPU usage/load?
P.S.1. I have followed the OpenFOAM compiling following the official instructions. I did not do anything with OpenMPI and it's mpicc compiler for that matter though.
P.S.2 For those of you who might end up here. Maybe I'm running the wrong command apparently one can first allocate some resources by:
srun -N 1 --ntasks-per-node=7 --pty bash
where 7 is the number of cores you want and bash is just a name. and then run the solver with:
mpirun -np 7 sonicFoam -parallel -fileHandler uncollated
I'm not sure yet though.
You can use
sacct --format='jobid,AveCPU,MinCPU,MinCPUTask,MinCPUNode'
to check whether all CPUs have been active. Compare AveCPU (average CPU time of all tasks in job) with MinCPU (minimum CPU time of all tasks in job). If they are equal, all 6 tasks (you requested 6 nodes, with, implicitly, 1 task per node) worked equally. If they are not equal, or even MinCPU is zero, then some tasks have been doing nothing.
But in your case, I believe you will observe that all tasks have been working hard, but they were all doing the same thing.
Besides the remark concerning the -parallel flag by #timdykes, you also must be aware that launching an MPI job with sun requires that OpenMPI was compiled with Slurm support. During your installation of OpenFOAM, it installed its own version of OpenMPI, and if file /usr/include/slurm/slurm.h or /usr/include/slurm.h exists, then Slurm support was probably compiled in. But the safest is probably to use mpirun.
But to do that, you will have to first request an allocation from Slurm with either sbatch or salloc.
Have you tried running with the '-parallel' argument? All of the OpenFOAM examples online use this argument when running a parallel job, one example is the official guide for running in parallel.
srun -N $NTASKS -l sonicFOAM -parallel
As an aside - I saw you built openfoam yourself, have you checked whether the cluster admins have provided a module for it? You can usually run module avail to see a list of the available modules, and then module load moduleName if there is an existing OpenFOAM module. This is useful as you can probably trust its been built with all the right options and would automatically set up your $PATH etc.

Very slow RedHawk component builds

We have some components that build 15+ object files before linking them. We find that if we modify a .h file used by many or all, that builds are VERY slow. Some of our components take over an hour to build. It appears that RedHawk issues a make -j or a make -j with a large number, so that we have 15+ compiles running simultaneously and this overwhelms even 4 GB of RAM and results in excessive swapping and VERY slow execution (the entire CPU is nearly locked up, other windows are also dead until it completes). If we use a simple make from shell in the component it completes in 5 min. Is there a way to change RH to issue a simple make or make with an adjustable number of max processes?
If you're referring to how the IDE invokes the build you can check the build console. I'm pretty sure it either calls the top level build.sh or the build.sh within your implementation's folder. In either case you can modify that file to perform the build however you'd like.

How to use Rmpi in R on linux Cluster to increase cores available with DEoptim?

I am using code developed in R to calibrate a hydrological model with 8 parameters using DEoptim (a function that aims to minimise an objective function). The DEoptim code uses the 'parallel' package to detect the number of cores available using 'DetectCores()'. On my PC I have 4 cores with 2 threads each so it detects 8 cores and then sends out the hydrological model to a core with different values of parameters and the results are returned to the centre. It does this hundreds or thousands of times and iterates the parameters to try and find an optimum set. Therefore the more cores available, the faster it will work.
I am at a university and have access to a Linux compute cluster. They have servers with up to 12 cores (i.e. not threads) and if I used this it would work two - three times faster than my PC. Great. However, ideally I would spread the code around other servers so I could have access to more cores and all the info sent back the master.
Therefore, my question is how could I include Rmpi in my code to effectively increase the cores available. As you can probably tell, I am quite new to using clusters.
Many thanks, Antony
If you want to execute DEoptim on multiple nodes of a Linux cluster, I believe you'll need to use foreach by specifying parallelType=2 in the control argument. You can use either the doMPI parallel backend or the doParallel backend with an MPI cluster object. For example:
library(doParallel)
library(Rmpi)
cl <- makeCluster(mpi.universe.size()-1, type='MPI')
registerDoParallel(cl)
# and eventually...
DEoptim(fn=Genrose, lower=rep(-25, n), upper=rep(25, n),
control=list(NP=10*n, itermax=maxIt, parallelType=2))
You'll need to have the snow package installed in addition to the others. Also, make sure that you execute your script with mpirun using the -np 1 option. If you don't use mpirun, the workers will all be spawned on the local machine.

SGE/UGE/etc..standardized way to submit OpenMP jobs to multiple cores?

I'm looking for a way to submit an OpenMP job to a Grid Engine scheduler, while specifying the number of cores it should run on. Something equivalent to LSF's -n option, or PBS's -l nodes=[count] option.
When I search on this, I'm see a bunch of answers specifying syntax like "-pe threaded [number of cores]". In those answers, there is no mention of having to create a parallel environment called "threaded". But when I try this syntax, it fails, saying that the requested parallel environment threaded does not exist. And when I type "qconf -spl", the only result I get is "make". So - should this "threaded" parallel environment exist by default, or is this something that has to be manually created on the cluster?
If it has to be manually created, is there any other syntax to submit jobs to multiple cores that does not rely on configurable naming on a cluster? This is for a third party program submitting to a cluster, so I don't want to have to rely not only on the client having created this pe, but naming it the same, etc... I was hoping the -l option might have something, but I haven't been able to find any permutation of that to achieve this.
If you get only "make" as possible parallel environment then this means that there are no parallel environments set on your cluster.
There are two solutions to your problem, depending on these 2 situations:
A) you have root/admin access to the cluster
B) you don't
In case B, well ask your administrator to create a parallel environment. In case A, you have to create a parallel environment. To create a new parallel environment you must type (requires root/admin privilege):
qconf -ap <pe_name>
And the default editor will start with a default pe_conf file that you must edit. If you need to setup only an openMP parallel environment you can use these options:
pe_name smp
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
and for a MPI parallel environment:
pe_name mpi
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary TRUE
as you notice, in the latter case you will point SGE to the right initialization script and shutdown script for your MPI configuration. In the first case, you simply point to /bin/true.
The allocation_rule are different in this example. $fill_up means that SGE will fill any CPU it can find with parts of the MPI job, while for smp configuration you simply allocate the correct number of slots on the same machine, i.e. $pe_slots.
If you use MPI, your nodes should be connected using a high performance switch such as infiniband otherwise your jobs will spend much more time communicating than calculating.
EDIT:
oh, btw: the correct synthax to submit a job with a parallel environment is effectively:
qsub -pe <pe_name> <nb_slots>
FINAL EDIT:
the final answer to the question comes in the comments here below. In practice, SGE cannot handle multi-thread jobs if a parallel environment (PE) is not set on the cluster. If you do not have admin privileges on the cluster, you must either guess for the correct PE that has to be used using qconf -spl and inspect the different PEs with qconf -sp <pe_name>, or add an option in your software that allows the users to specify the PE that has to be used.
Otherwise, i.e. if no PE are available on the cluster, you cannot use a parallel version of your software.
See the comments for further information.

Is tesseract 3.00 multi-threaded?

I read some other posts suggesting that they would add multi-threading support in 3.00. But I'm not sure if it's added in 3.00 when it was released.
Other than multi-threading, is running multiple processes of tesseract a feasible option to achieve concurrency?
Thanks.
One thing I've done is invoked GNU Parallel to run as many instances of Tess* as able on a multi-core system for multi-page documents converted to single page images.
It's a short program, easily compiled on most Linux distros (I'm using OpenSuSE 11.4).
Here's the command line that I use:
/usr/local/bin/parallel -j 4 \
/usr/local/bin/tesseract -psm 1 -l eng {} {.} \
::: /tmp/tmp/*.jpg
The -j 4 tells parallel to use all four CPU cores that I have on a server.
If you run this, and in another terminal do a 'top,' you'll see up to four processes at one time until it rummages through all of the JPG's in the directory specified.
Your load should never exceed the number of CPU cores in your system (if you run Linux).
Here's the link to GNU Parallel:
http://www.gnu.org/software/parallel/
No. You can browse the code in http://code.google.com/p/tesseract-ocr/source/browse/ None of the current code in trunk seems to make use of multi-threading. (at least looking through the base classes, api, and neural networking classes)
I did use parallel as well, on a Centos, this way:
ls | parallel --gnu "tesseract {} {.}"
I used the --gnu option as suggested from the stdout log which was:
parallel: Warning: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
the {} and {.} are placeholders for parallel: in this case you're telling tesseract to use the file listed as first argument, and the same file name without extension as second argument - everything is well explained in parallel man pages.
Now, if you have - say - three .tif files and you run tesseract three times, one for each file, summing up the execution time, and then you run the command above with time before parallel, you can easily check the speedup.

Resources