Parallel computing: how to share computing resources among users? - linux

I am running a simulation on a Linux machine with the following specs.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz
Stepping: 4
CPU MHz: 3099.902
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
This is the run command line script for my solver.
/path/to/meshfree/installation/folder/meshfree_run.sh # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N # on N parallel MPI processes
I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?
I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.

Q : "What would be the fastest option for me in this case? ...running short on time. I am already in the middle of a simulation."
Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:
check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.
Q : "Using 30 MPI processes?"
No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.
Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )
The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )
And do consult the meshfree Support for their Knowledge Base sources on Best Practices.
Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

Related

Where to find ipc (or cpi) value of Intel processors (say skylake) when diff no of physical and logical cores are used?

I am very new to this field and my question might be too stupid but please help me understand the fundamental here.
I want to know the instruction per cycle (ipc) or clock per instruction (cpi) of recent intel processors such as skylake or cascade lake. And I am also looking for these values when different no of physical cores are used and when hyper threading is used.
I thought spec cpu2017 benchmark results could help me here, but I could not find my ans there. They just compare the total execution time by time taken by some reference machine and gives the ratio. Am I missing something here?
I thought this is one of the very first performance parameters and should be calculated and published by some standard benchmark, but I could not find any. Am I missing something here?
Another related question which comes to my mind (and I think everybody might want to know) is what is the best it can provide using all the cores and threads (least cpi and max ipc)?
Please help me find ipc / cpi value of skylake (any Intel processor) when say maximum (28) cores are used and when hyperthreading is also enabled.
The IPC cost of hyperthreading (or SMT in general on non-Intel CPUs) totally depends on the workload.
If you're already bottlenecked on branch mispredicts, cache misses, or long dependency chains (low ILP), having 2 threads running on the same core leads to minimal interference.
(Partitioning the ROB reduces the ability to find ILP in either thread, though, so again it depends on the details.)
Competitive sharing of uop cache and L1d/L1i / L2 caches also might or might not be a problem, depending on cache footprint.
There is no general answer independent of workload
Some workloads get a major speedup from using HT to double the number of logical cores. Some high-ILP workloads actually do worse because of cache conflicts. (Workloads that can already come close to saturating the front-end at 4 uops per clock on Intel before Icelake, for example).
Agner Fog's microarch guide says a bit about this for some microarchitectures that support hyperthreading. https://agner.org/optimize/
IIRC, some AMD CPUs have higher front-end throughput with hyperthreading, but I think only Bulldozer-family.
Max throughput is not affected by HT, and each core is independent. e.g. 4 uops per clock for a Skylake core. Doubling the number of physical cores always doubles theoretical uops / clock. Obviously not all workloads parallelize efficiently, so running more threads might need more total instructions/uops, and/or create more memory stalls for communication.
Hyperthreading just helps you come closer to that more of the time by letting 2 threads fill each other's "bubbles" from stalls.

Understanding output of lscpu

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?
(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.
NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.
Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

Dual processor vs single processor for a LAMP server

I have a question regarding server performance of a dual processor socket server vs a single processor socket server. My two choices are as follows. Assume that the rest like RAM and HD are identical.
1 Processor:
Xeon E5-2630 v3 # 2.40GHz
Socket: LGA2011-v3
Clockspeed: 2.4 GHz
Turbo Speed: 3.2 GHz
No of Cores: 8 (2 logical cores per physical)
VS.
2 Processors:
Xeon E3-1270 v3 # 3.50GHz
Socket: LGA1150
Clockspeed: 3.5 GHz
Turbo Speed: 3.9 GHz
No of Cores: 4 (2 logical cores per physical)
Combined, the number of cores is the same, so in theory the server performance should be the same right?
The server is a LAMP box with a huge database constantly being queried with select queries.
Combined, the number of cores is the same, so in theory the server performance should be the same right?
The stats you listed yourself differ, why would you expect the same level of performance?
Most bottlenecks in real-world non-optimized codebases stem from cache misses, so you want a machine with more cpu cache. That's the first one.
Similarly, more than one socket is typically detrimental to performance unless you have a workload which knows how to use them.
But the real question is what kind of traffic is expected here. Perhaps there is 0 reason to go with such a box in the first place. You really want to consult someone versed in the area and not just buy/rent a box because you can.

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Benchmarks for linux/apache performance on 1000 concurrent requests

I have a linux server of following configuration:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz
Stepping: 4
CPU MHz: 2500.046
BogoMIPS: 5000.09
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0
RAM = 2 GB
I am doing a test in j-meter with 500 concurrent threads and ramp up period of 10 sec. I get a throughput of 3/sec. How do i know this throughput is good or bad. Are there any benchmarks from which i can compare my results ie. throughput and deviation.
There's no "good" or "bad" throughput per se. Depending on the goals of the application, and expectations of your stakeholders, 3/sec throughput can be perfect or terrible. So first question would be what are your performance goals?
Secondly, conclusion about throughput can only be reached when you have a graph, not a single point. A typical graph for throughput looks like this:
so without having points with lower and higher number of users, you don't really know if you are on "growing" or "plateau" phase. Also throughput graph may look like this:
so having only one point you also don't know if it is growing, on plateau or deteriorating.
Finally number of users is only one parameter that can influence throughput. Others may include: types of requests, size of data, duration of the test (e.g. long test may show throughput tendencies to increase or degrade invisible on shorter test), etc.
So my suggestion would be:
Check with your stakeholders what the goals of your application are. For example what do developers expect throughput to be.
Establish a solid benchmark in terms of operations. Something that represents the "majority of users". Make sure your test takes into account operations that may drastically change the throughput (or at least know which of them you excluded on purpose).
Establish the duration of the test, which will likely show a stable throughput (for some applications it can be 10 min, for others it might be 24 hours)
Without changing the test, run it multiple times with incrementing number of users (e.g. 100, 200, 300, ...) to obtain a graph. Continue to increase number of users until either application breaks or performance deteriorates to unacceptable rates.
That will give you the full picture and a good understanding if 3/sec throughput is the best your app can do, or it can do much better. And whether the best your app can do is good enough for your stakeholders.
BTW second picture was taken from this page, which is a really good read on the topic.

Resources