I'm trying to understand what stress command actually does in Linux, in particular -c option. My background is Physics, so I'm struggling with some concepts.
Does stress -c launch 3 processes that consume 100% of 3 bounded CPU cores (for example core 0, 1, 3)? The output of htop is confusing since I don't see 3 CPU cores at 100% all the time. Note: with bounded, I mean that these processes cannot run on other CPU cores (in this case 4 to N).
For example, after running stress -c 3, sometimes I see this (which makes sense to me):
But most of the time I'm seeing something like this (which doesn't because there aren't 3 CPU cores at 100%):
My x64 machine is running CentOS 8 and has 20 cpus.
Out of 20 cpus, only 2 cpus (0 and 1) are available for normal use and the rest 18 cpus (2-19) are isolated.
When I try taskset -c 2-19 make -j 15 for example, only 1 cpu out of the 18 seems to be used.
It looks like the cpuset specified by taskset is not applied to child processes.
Is it possible to let make and all the child processes fully utilize the isolated cpus?
I run MPI fortran program on a cluster equiped with LSF job system.
My program also contains MKL function.
I know there is subroutine can set MKL thread number(for example set it to 2)
call mkl_set_num_threads(2)
As first, I thought this set the total thread number for a program. But as I tested, this seems setting thread number for each MPI process
so If I submit a job like
bsub -n 2 mpiexec.hydra ./a.out
and ssh into the node and Top, I found that it actually uses 4 cores, each MPI process uses 2 threads.
But this is not allowed on my cluster, because it uses cpu resources more than requested and will be killed during running.
sometimes the number of MPI processes can not divide cpu cores. For example, if a node has 24 cores, and I have 7 MPI process to run, I would like to submit like
bsub -n 24 mpiexec.hydra -n 7 ./a.out
Since MKL has Dynamic functionality, MKL will automatically distribute resources dynamically to 7 MPI tasks, and efficiently uses all the cpu.
But if now the cluster is kind of full. I can only request 12 cores, then
bsub -n 12 mpiexec.hydra -n 7 ./a.out
then How to set MKL to use exactly 12 mkl thread among 7 MPI task and so not get killed by system and yet remain maximun efficiency.
We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. I have configured gres.conf and srun works.
The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible.
The error message says that it could not allocate required resources (even though there are 24 CPU cores).
This is my gres.conf:
Name=gpu Type=titanx File=/dev/nvidia0
Name=gpu Type=titanx File=/dev/nvidia1
NodeName=ubuntu Name=gpu Type=titanx File=/dev/nvidia[0-1]
I appreciate any help. Thank you.
Make sure that SelectType in your configuration is CR_CPU or CR_Core and that the shared option of the partition is not set to exclusive. Otherwise Slurm allocates full nodes to jobs.
How do I assign 2 MPI processes per core?
For example, if I do mpirun -np 4 ./application then it should use 2 physical cores to run 4 MPI processes (2 processes per core). I am using Open MPI 1.6. I did mpirun -np 4 -nc 2 ./application but wasn't able to run it.
It complains mpirun was unable to launch the specified application as it could not find an executable:
orterun (the Open MPI SPMD/MPMD launcher; mpirun/mpiexec are just symlinks to it) has some support for process binding but it is not flexible enough to allow you to bind two processes per core. You can try with -bycore -bind-to-core but it will err when all cores already have one process assigned to them.
But there is a workaround - you can use a rankfile where you explicitly specify which slot to bind each rank to. Here is an example: in order to run 4 processes on a dual-core CPU with 2 processes per core, you would do the following:
mpiexec -np 4 -H localhost -rf rankfile ./application
where rankfile is a text file with the following content:
rank 0=localhost slot=0:0
rank 1=localhost slot=0:0
rank 2=localhost slot=0:1
rank 3=localhost slot=0:1
This will place ranks 0 and 1 on core 0 of processor 0 and ranks 2 and 3 on core 1 of processor 0. Ugly but works:
$ mpiexec -np 4 -H localhost -rf rankfile -tag-output cat /proc/self/status | grep Cpus_allowed_list
[1,0]<stdout>:Cpus_allowed_list: 0
[1,1]<stdout>:Cpus_allowed_list: 0
[1,2]<stdout>:Cpus_allowed_list: 1
[1,3]<stdout>:Cpus_allowed_list: 1
Edit: From your other question is becomes clear that you are actually running on a hyperthreaded CPU. Then you would have to figure out the physical numbering of your logical processors (it's a bit confusing but physical numbering corresponds to the value of processor: as reported in /proc/cpuinfo). The easiest way to obtain it is to install the hwloc library. It provides the hwloc-ls tool that you can use like this:
$ hwloc-ls --of console
...
NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0) <-- Physical ID 0
PU L#1 (P#12) <-- Physical ID 12
...
Physical IDs are listed after P# in the brackets. In your 8-core case the second hyperthread of the first core (core 0) would most likely have ID 8 and hence your rankfile would look something like:
rank 0=localhost slot=p0
rank 1=localhost slot=p8
rank 2=localhost slot=p1
rank 3=localhost slot=p9
(note the p prefix - don't omit it)
If you don't have hwloc or you cannot install it, then you would have to parse /proc/cpuinfo on your own. Hyperthreads would have the same values of physical id and core id but different processor and apicid. The physical ID is equal to the value of processor.
I'm not sure if you have multiple machines or not, and the exact details of how you want the processes distributed, but I'd consider reading up:
mpirun man page
The manual indicates that it has ways of binding processes to different things, including nodes, sockets, and cpu cores.
It's important to note that you will achieve this if you simply run twice as many processes as you have CPU cores, since they will tend to evenly distribute over cores to share load.
I'd try something like the following, though the manual is somewhat ambiguous and I'm not 100% sure it will behave as intended, as long as you have a dual core:
mpirun -np 4 -npersocket 4 ./application
If you use PBS, or something like that, i would suggest this kind of submission:
qsub -l select=128:ncpus=40:mpiprocs=16 -v NPROC=2048./pbs_script.csh
In the present submission i select 128 computational nodes, that have 40 cores, and use 16 of them. In my case, i have 20 physical cores per node.
In this submission i block all the 40 cores of the node and nobody can use these resources. it can avoid other peoples from using the same node and competing with your job.
Using Open MPI 4.0, the two commands:
mpirun --oversubscribe -c 8 ./a.out
and
mpirun -map-by hwthread:OVERSUBSCRIBE -c 8 ./a.out
worked for me (I have a Ryzen 5 processor with 4 cores and 8 logical cores).
I tested with a do loop that includes operations on real numbers. All logical threads are used, though it seems that there is no speedup benefit since computation takes double the amount of time compared to using -c 4 option (with no oversubscribing).
You can run
mpirun --use-hwthread-cpus ./application
In this case, Open MPI will consider that a processor is a thread provided by the Hyperthreading. This contrasts with the default behavior when it considers that a processor is a CPU core.
Open MPI denotes the threads provided by the Hyperthreading as "hardware threads" when you use this option, and allocates one Open MPI processor per "hardware thread".