Linux - Need to limit a 16 core system to 4 to test performance running multiple programs - linux

I have a couple of programs that will be put on a system with limited core's.
I want to test performance of these programs on my current system that is a lot more powerful than the one it will be on.
Is the only way to fully limit the proper resource is through a Virtual Machine on my system, or can I just restrict my system to meet the same core limit as the other system my programs will run on?

taskset might help you.
Start your application your_command as follows:
taskset -ac 0-3 your_command
# -c 0-3: your_command might run on cores 0 to 3
# a : all of the 4 cores may be used
If the application is already running:
taskset -acp 0-3 PID
# PID = process ID
See this answer to 'Limit process to one cpu core' at Unix & Linux for further details .

Related

openMp and the number of cores vs cpus

I'm wondering about how openmp figures out how many threads it can run via the omp_get_max_threads() library call. I'm running on a centOS linux machine using gcc -fopenmp. My machine has 16 AMD Opteron(tm) Processor 6136 CPUs, each with 8 cores, all according to /proc/cpuinfo. If I run omp_get_num_procs() it returns 16. But omp_get_max_threads() also returns 16. Why isn't the max threads number 16*8?
When I run a program that uses 16 threads I see the program in top running at ~1600% of CPU and if I toggle 'Last used cpu (SMP)' that number moves around a bit. So the 1600% makes sense but is there any way to know which cores of which CPUs the threads are running on?
I'm pretty new to openmp so sorry if these questions seem naive.
You can use the hwloc tool set to know the binding of the threads of any application to the hardware threads/cores. You need only the name or the PID of the target running process. Here is an example:
$ hwloc-ps --pid 2038168 --threads --get-last-cpu-location
2038168 Machine:0 ./a.out
2038168 Core:5 a.out
2038169 Core:3 a.out
2038170 Core:1 a.out
2038171 Core:4 a.out
2038172 Core:0 a.out
2038173 Core:2 a.out
Here we can see that the process a.out (with the PID 2038168) uses 6 threads each map on different cores.
However, the mapping of threads on cores over time can change if you do not configure OpenMP properly (a starting point is to set the environment variables OMP_PROC_BIND and OMP_PLACES).
Additionally, you can use  hwloc-ps to understand the topology of your machine (how many cores there are, how many threads, how they are connected, etc.).
I am very surprise you can have 16 "AMD Opteron(tm) Processor 6136 CPUs". Indeed, this processor use the G34 socket which is available in up to 4-socket arrangements (and 8 dies). So, please check this with hwloc-ps!
An alternative way is to use a profiling tool (such as Intel VTune).

Netlogo HPC CPU Percentage Use Increase

I submit jobs using headless NetLogo to a HPC server by the following code:
#!/bin/bash
#$ -N r20p
#$ -q all.q
#$ -pe mpi 24
/home/abhishekb/netlogo/netlogo-5.1.0/netlogo-headless.sh \
--model /home/abhishekb/models/corrected-rk4-20presults.nlogo \
--experiment test \
--table /home/abhishekb/csvresults/corrected-rk4-20presults.csv
Below is the snapshot of a cluster queue using:
qstat -g c
I wish to know can I increase the CQLOAD for my simulations and what does it signify too. I couldn't find an elucidate explanation online.
CPU USAGE CHECK:
qhost -u abhishekb
When I run the behaviour space on my PC through gui assigning high priority to the task makes it use nearly 99% of the CPU which makes it run faster. It uses a greater percentage of CPU processor. I wish to accomplish the same here.
EDIT:
EDIT 2;
A typical HPC environment, is designed to run only one MPI process (or OpenMP thread) per CPU core, which has therefore access to 100% of CPU time, and this cannot be increased further. In contrast, on a classical desktop/server machine, a number of processes compete for CPU time, and it is indeed possible to increase performance of one of them by setting the appropriate priorities with nice.
It appears that CQLOAD, is the mean load average for that computing queue. If you are not using all the CPU cores in it, it is not a useful indicator. Besides, even the load average per core for your runs just translates the efficiency of the code on this HPC cluster. For instance, a value of 0.7 per core, would mean that the code spends 70% of time doing calculations, while the remaining 30% are probably spent waiting to communicate with the other computing nodes (which is also necessary).
Bottom line, the only way you can increase the CPU percentage use on an HPC cluster is by optimising your code. Normally though, people are more concerned about the parallel scaling (i.e. how the time to solution decreases with the number of CPU cores) than with the CPU percentage use.
1. CPU percentage load
I agree with #rth answer regards trying to use linux job priority / renice to increase CPU percentage - it's
almost certain not to work
and, (as you've found)
you're unlikely to be able to do it as you won't have super user priveliges on the nodes (It's pretty unlikely you can even log into the worker nodes - probably only the head node)
The CPU usage of your model as it runs is mainly a function of your code structure - if it runs at 100% CPU locally it will probably run like that on the node during the time its running.
Here are some answers to the more specific parts of your question:
2. CQLOAD
You ask
CQLOAD (what does it mean too?)
The docs for this are hard to find, but you link to the spec of your cluster, which tells us that the scheduling engine for it is Sun's *Grid Engine". Man pages are here (you can access them locally too - in particular typing man qstat)
If you search through for qstat -g c, you will see the outputs described. In particular, the second column (CQLOAD) is described as:
OUTPUT FORMATS
...
an average of the normalized load average of all queue
hosts. In order to reflect each hosts different signifi-
cance the number of configured slots is used as a weight-
ing factor when determining cluster queue load. Please
note that only hosts with a np_load_value are considered
for this value. When queue selection is applied only data
about selected queues is considered in this formula. If
the load value is not available at any of the hosts '-
NA-' is printed instead of the value from the complex
attribute definition.
This means that CQLOAD gives an indication of how utilized the processors are in the queue. Your output screenshot above shows 0.84, so this indicator average load on (in-use) processors in all.q is 84%. This doesn't seem too low.
3. Number of nodes reserved
In a related question, you state colleagues are complaining that your processes are not using enough CPU. I'm not sure what that's based on, but I wonder the real problem here is that you're reserving a lot of nodes (even if just for a short time) for a job that they can see could work with fewer.
You might want to experiment with using fewer nodes (unless your results are very slow) - that is achieved by altering the line #$ -pe mpi 24 - maybe take the number 24 down. You can work out how many nodes you need (roughly) by timing how long 1 model run takes on your computer and then use
N = ((time to run 1 job) * number of runs in experiment) / (time you want the run to take)
So you want to make to make your program run faster on linux by giving it a higher priority than all other processes?
In that case you have to modify something called the program's niceness. This is normally done by invoking the command nice when you first start the program or the command renice while the program is already running. A process can have a niceness from -20 to 19 (inclusive) where lower values give the process a higher priority. Due to security reasons, you can only decrease a processes' niceness if you are the super user (root).
So if you want to make a process run with higher priority then from within bash do
[abhishekb#hpc ~]$ start_process &
[abhishekb#hpc ~]$ jobs -x sudo renice -n -20 -p %+
Or just use the last command and replace the %+ with the process id of the process you want to increase the priority for.

Could posix-threads created in one process run parallel across two physical processors?

Could posix threads created in one program (process) run on two physical processors?
I have some multi-thread code, need to run them on a dual eight-core AMD server node ( eight real core no hyperthreading). Not sure if these threads can be mapped across the two physical processors.
Also, it would be very helpful, if someone can suggest some linux command for monitoring CPU usage.
Thank you in advance.
You can use default commands that come with any linux distro
1) top
2) ps
top - is interactive and displays different parameters updating them over the time
ps will be usefull with aux argument
ps aux
It will display different parameters about the active programs.
You can look man pages for this commands to make them display info you need
yes, the threads may run on different CPUs unless you do something on the OS layer to bind them to a spcific CPU.
top is one of the commands to monitor CPU usage

Node.js - targeting a cpu core

How do you start a node process, targetting a specific CPU core?
I've seen node cluster, but I'm interested in starting two different processes, on different cores.
I was assuming there was a way of doing this when starting node from the command line, i.e:
node myapp.js
I'd be interested to know how to do this in both windows and linux, if there is a difference.
In general, the scheduler will do a pretty good job of this without any help. In the wild, I've only seen one situation where this mattered....
We were deploying a node service on an 8-core box and during load testing we made the strangest observation... the service actually performed better with 7 workers than with 8. A bit of debugging later and we figured out that all network interrupts were being handled by core 0. I played with using 15 workers so that core0 would have a 1/2 share of the load compared to the other cores. Ultimately, I think we ended up going with 7 workers because it was simpler and more predictable and the complexity just wasn't worth it to get ~7% more theoretic throughput.
That being said, to set CPU affinity on linux:
$ taskset -pc 3089
pid 3089's current affinity list: 0,1
$ taskset -p 3089
pid 3089's current affinity mask: 3 # core 0 = 0x1, core 1 = 0x2
$ taskset -pc 1,2,3 3089
pid 3089's current affinity list: 0,1
pid 3089's new affinity list: 1
On linux you can use taskset to run node with a given CPU affinity. See this post for information on using the start command in Windows to do the same.

assign two MPI processes per core

How do I assign 2 MPI processes per core?
For example, if I do mpirun -np 4 ./application then it should use 2 physical cores to run 4 MPI processes (2 processes per core). I am using Open MPI 1.6. I did mpirun -np 4 -nc 2 ./application but wasn't able to run it.
It complains mpirun was unable to launch the specified application as it could not find an executable:
orterun (the Open MPI SPMD/MPMD launcher; mpirun/mpiexec are just symlinks to it) has some support for process binding but it is not flexible enough to allow you to bind two processes per core. You can try with -bycore -bind-to-core but it will err when all cores already have one process assigned to them.
But there is a workaround - you can use a rankfile where you explicitly specify which slot to bind each rank to. Here is an example: in order to run 4 processes on a dual-core CPU with 2 processes per core, you would do the following:
mpiexec -np 4 -H localhost -rf rankfile ./application
where rankfile is a text file with the following content:
rank 0=localhost slot=0:0
rank 1=localhost slot=0:0
rank 2=localhost slot=0:1
rank 3=localhost slot=0:1
This will place ranks 0 and 1 on core 0 of processor 0 and ranks 2 and 3 on core 1 of processor 0. Ugly but works:
$ mpiexec -np 4 -H localhost -rf rankfile -tag-output cat /proc/self/status | grep Cpus_allowed_list
[1,0]<stdout>:Cpus_allowed_list: 0
[1,1]<stdout>:Cpus_allowed_list: 0
[1,2]<stdout>:Cpus_allowed_list: 1
[1,3]<stdout>:Cpus_allowed_list: 1
Edit: From your other question is becomes clear that you are actually running on a hyperthreaded CPU. Then you would have to figure out the physical numbering of your logical processors (it's a bit confusing but physical numbering corresponds to the value of processor: as reported in /proc/cpuinfo). The easiest way to obtain it is to install the hwloc library. It provides the hwloc-ls tool that you can use like this:
$ hwloc-ls --of console
...
NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0
PU L#0 (P#0) <-- Physical ID 0
PU L#1 (P#12) <-- Physical ID 12
...
Physical IDs are listed after P# in the brackets. In your 8-core case the second hyperthread of the first core (core 0) would most likely have ID 8 and hence your rankfile would look something like:
rank 0=localhost slot=p0
rank 1=localhost slot=p8
rank 2=localhost slot=p1
rank 3=localhost slot=p9
(note the p prefix - don't omit it)
If you don't have hwloc or you cannot install it, then you would have to parse /proc/cpuinfo on your own. Hyperthreads would have the same values of physical id and core id but different processor and apicid. The physical ID is equal to the value of processor.
I'm not sure if you have multiple machines or not, and the exact details of how you want the processes distributed, but I'd consider reading up:
mpirun man page
The manual indicates that it has ways of binding processes to different things, including nodes, sockets, and cpu cores.
It's important to note that you will achieve this if you simply run twice as many processes as you have CPU cores, since they will tend to evenly distribute over cores to share load.
I'd try something like the following, though the manual is somewhat ambiguous and I'm not 100% sure it will behave as intended, as long as you have a dual core:
mpirun -np 4 -npersocket 4 ./application
If you use PBS, or something like that, i would suggest this kind of submission:
qsub -l select=128:ncpus=40:mpiprocs=16 -v NPROC=2048./pbs_script.csh
In the present submission i select 128 computational nodes, that have 40 cores, and use 16 of them. In my case, i have 20 physical cores per node.
In this submission i block all the 40 cores of the node and nobody can use these resources. it can avoid other peoples from using the same node and competing with your job.
Using Open MPI 4.0, the two commands:
mpirun --oversubscribe -c 8 ./a.out
and
mpirun -map-by hwthread:OVERSUBSCRIBE -c 8 ./a.out
worked for me (I have a Ryzen 5 processor with 4 cores and 8 logical cores).
I tested with a do loop that includes operations on real numbers. All logical threads are used, though it seems that there is no speedup benefit since computation takes double the amount of time compared to using -c 4 option (with no oversubscribing).
You can run
mpirun --use-hwthread-cpus ./application
In this case, Open MPI will consider that a processor is a thread provided by the Hyperthreading. This contrasts with the default behavior when it considers that a processor is a CPU core.
Open MPI denotes the threads provided by the Hyperthreading as "hardware threads" when you use this option, and allocates one Open MPI processor per "hardware thread".

Resources