Max number of threads that can run on unix a machine - multithreading

I have below configuration of a unix machine:
Command :
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
Result :
CPUs : 8
Thread(s) per core : 8
Core(s) per socket : 1
Socket(s) : 1
My understanding is:
Max. number of threads that can run on this machine = Sockets X Cores per Socket X Thread(s) per Core
Or
Max. number of threads that can run on this machine = CPUs
Is this understanding correct?
or
Is there different formulae to decide maximum number of threads that can run on a machine?
EDIT
I meant max. number of threads that can run in parallel.
e.g. by starting n number of threadpools etc.
For increasing performance of my application I want to run it on max. number of threads, can it be determined by above parameters?

Related

Linux command that tracks statistics of CPU usage while running application on HPC/HTC

In my PBS script, I am running matlab and would like to know how many many cores were actually used during the time. Especially I would like to know the max number of cores used at a time.
If I only allocate x number of cores but at any time matlab uses more than x number of cores then my job will be stopped and cancelled by the HPC/HTC system.
Ideally the command and output would be as simple as
cpustats matlab -nojvm -r "someExperiment(params);exit()"
Max CPU usage: 12.5 cores
Average CPU usage: 6 cores
Min CPU usage: 0.5 cores
I can't monitor the progress manually because it is a batch script so I am planning on running once with plenty of cores and then modifying the rest so I don't have to wait so long.
I have searched and searched for a command like this but the following don't seem to be what I am looking for
top finds the current cpu usage which I don't have access to
ps finds cpu allotted to a process and not actual usage
watch might be useful to query random cpu times and output them but would like a continuous stream if possible
time is really close to what I want but doesn't keep track of peak CPU usage
The most similar question I could find was this one about peak memory usage

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Reserve CPU time for one program?

Prerequisites
I have a (physical) server running multiple (virtual) servers. There are 11 servers in total, number 0 through 9 are invoked by
servinit XXXXn
Where XXXXn is the port number and n is the server number. The other server is invoked by
apiinit
And runs on port 8080. In conclusion, there are 11 processes, 10 with the binary name servinit and one with apiinit.
Goal
The servinit processes must always be responsive, in other words, the apiinit process must never consume all CPU time. I want to limit the total CPU time for apiinit to a percentage number, lets say 90, so that the servinit processes always have 10 percent CPU headroom to operate flawlessly.
What is the most efficient way of handling this?
Software
The physical server runs
Ubuntu Desktop
Release 12.04 (precise) 64-bit
Kernel: 3.14.32-xxxx-std-ipv6-64
Since you run a 3.14+ Linux kernel you can easily constrain the CPU share of a running application through the SCHED_DEADLINE policy. This policy allows you to set the CPU share of an application by setting a budget and a period (the emaining is that the application is not allowed to consume more than its budget on a period of time). For example, if budget is 3msec and period is 10msec, the application can at most consume at most 30% of the CPU. In particular, the system will guarantee 3msec every 10msec.
Solved using the optional package cpulimit.
Example, limiting apiinit to 50 percent CPU time (on a dual-core CPU):
sudo cpulimit -e apiinit -l 100 &

assign two MPI processes per core

How do I assign 2 MPI processes per core?
For example, if I do mpirun -np 4 ./application then it should use 2 physical cores to run 4 MPI processes (2 processes per core). I am using Open MPI 1.6. I did mpirun -np 4 -nc 2 ./application but wasn't able to run it.
It complains mpirun was unable to launch the specified application as it could not find an executable:
orterun (the Open MPI SPMD/MPMD launcher; mpirun/mpiexec are just symlinks to it) has some support for process binding but it is not flexible enough to allow you to bind two processes per core. You can try with -bycore -bind-to-core but it will err when all cores already have one process assigned to them.
But there is a workaround - you can use a rankfile where you explicitly specify which slot to bind each rank to. Here is an example: in order to run 4 processes on a dual-core CPU with 2 processes per core, you would do the following:
mpiexec -np 4 -H localhost -rf rankfile ./application
where rankfile is a text file with the following content:
rank 0=localhost slot=0:0
rank 1=localhost slot=0:0
rank 2=localhost slot=0:1
rank 3=localhost slot=0:1
This will place ranks 0 and 1 on core 0 of processor 0 and ranks 2 and 3 on core 1 of processor 0. Ugly but works:
$ mpiexec -np 4 -H localhost -rf rankfile -tag-output cat /proc/self/status | grep Cpus_allowed_list
[1,0]<stdout>:Cpus_allowed_list: 0
[1,1]<stdout>:Cpus_allowed_list: 0
[1,2]<stdout>:Cpus_allowed_list: 1
[1,3]<stdout>:Cpus_allowed_list: 1
Edit: From your other question is becomes clear that you are actually running on a hyperthreaded CPU. Then you would have to figure out the physical numbering of your logical processors (it's a bit confusing but physical numbering corresponds to the value of processor: as reported in /proc/cpuinfo). The easiest way to obtain it is to install the hwloc library. It provides the hwloc-ls tool that you can use like this:
$ hwloc-ls --of console
...
NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0) <-- Physical ID 0
PU L#1 (P#12) <-- Physical ID 12
...
Physical IDs are listed after P# in the brackets. In your 8-core case the second hyperthread of the first core (core 0) would most likely have ID 8 and hence your rankfile would look something like:
rank 0=localhost slot=p0
rank 1=localhost slot=p8
rank 2=localhost slot=p1
rank 3=localhost slot=p9
(note the p prefix - don't omit it)
If you don't have hwloc or you cannot install it, then you would have to parse /proc/cpuinfo on your own. Hyperthreads would have the same values of physical id and core id but different processor and apicid. The physical ID is equal to the value of processor.
I'm not sure if you have multiple machines or not, and the exact details of how you want the processes distributed, but I'd consider reading up:
mpirun man page
The manual indicates that it has ways of binding processes to different things, including nodes, sockets, and cpu cores.
It's important to note that you will achieve this if you simply run twice as many processes as you have CPU cores, since they will tend to evenly distribute over cores to share load.
I'd try something like the following, though the manual is somewhat ambiguous and I'm not 100% sure it will behave as intended, as long as you have a dual core:
mpirun -np 4 -npersocket 4 ./application
If you use PBS, or something like that, i would suggest this kind of submission:
qsub -l select=128:ncpus=40:mpiprocs=16 -v NPROC=2048./pbs_script.csh
In the present submission i select 128 computational nodes, that have 40 cores, and use 16 of them. In my case, i have 20 physical cores per node.
In this submission i block all the 40 cores of the node and nobody can use these resources. it can avoid other peoples from using the same node and competing with your job.
Using Open MPI 4.0, the two commands:
mpirun --oversubscribe -c 8 ./a.out
and
mpirun -map-by hwthread:OVERSUBSCRIBE -c 8 ./a.out
worked for me (I have a Ryzen 5 processor with 4 cores and 8 logical cores).
I tested with a do loop that includes operations on real numbers. All logical threads are used, though it seems that there is no speedup benefit since computation takes double the amount of time compared to using -c 4 option (with no oversubscribing).
You can run
mpirun --use-hwthread-cpus ./application
In this case, Open MPI will consider that a processor is a thread provided by the Hyperthreading. This contrasts with the default behavior when it considers that a processor is a CPU core.
Open MPI denotes the threads provided by the Hyperthreading as "hardware threads" when you use this option, and allocates one Open MPI processor per "hardware thread".

How can I ensure that a process runs in a specific physical CPU core and thread?

This question asks about ensuring two processes run on the same CPU. Using sched_setaffinity I can limit a process to a number of logical CPUs, but how can I ensure that these are mapped to specific physical CPUs and threads?
I expect that the mapping would be:
0 - CPU 0 thread 0
1 - CPU 0 thread 1
2 - CPU 1 thread 0
3 - CPU 1 thread 1
etc...
where the number on the left is the relevant CPU used in sched_setaffinity.
However, when I tried to test this, it appeared that this is not necessarily the case.
To test this, I used the CPUID instruction, which returns the initial APIC ID of the current core in EBX:
void print_cpu()
{
int cpuid_out;
__asm__(
"cpuid;"
: "=b"(cpuid_out)
: "a"(1)
:);
std::cout << "I am running on cpu " << std::hex << (cpuid_out >> 24) << std::dec << std::endl;
}
Then I looped over the bits in the cpu mask and set them one at a time so that the OS would migrate the process to each logical CPU in turn, and then I printed out the current CPU.
This is what I got:
cpu mask is 0
I am running on cpu 0
cpu mask is 1
I am running on cpu 4
cpu mask is 2
I am running on cpu 2
cpu mask is 3
I am running on cpu 6
cpu mask is 4
I am running on cpu 1
cpu mask is 5
I am running on cpu 5
cpu mask is 6
I am running on cpu 3
cpu mask is 7
I am running on cpu 7
assuming that the CPU assigns initial APIC IDs according to the scheme I listed above, it would seem that the cpu mask doesn't actually correspond to the physical core and thread.
How can I find the correct mapping of bits in the mask for sched_setaffinity to physical cores?
hwloc is a portable C library for discovering hardware/NUMA topology, and also binding processes/threads to particular cores. It has functions to discover physical/logical cores, and then bind a process/thread to it.
It also looks like it can also return a cpu_set_t for use with sched_setaffinity(), if you want to keep using that directly.

Resources