Dual processor vs single processor for a LAMP server - multithreading

I have a question regarding server performance of a dual processor socket server vs a single processor socket server. My two choices are as follows. Assume that the rest like RAM and HD are identical.
1 Processor:
Xeon E5-2630 v3 # 2.40GHz
Socket: LGA2011-v3
Clockspeed: 2.4 GHz
Turbo Speed: 3.2 GHz
No of Cores: 8 (2 logical cores per physical)
VS.
2 Processors:
Xeon E3-1270 v3 # 3.50GHz
Socket: LGA1150
Clockspeed: 3.5 GHz
Turbo Speed: 3.9 GHz
No of Cores: 4 (2 logical cores per physical)
Combined, the number of cores is the same, so in theory the server performance should be the same right?
The server is a LAMP box with a huge database constantly being queried with select queries.

Combined, the number of cores is the same, so in theory the server performance should be the same right?
The stats you listed yourself differ, why would you expect the same level of performance?
Most bottlenecks in real-world non-optimized codebases stem from cache misses, so you want a machine with more cpu cache. That's the first one.
Similarly, more than one socket is typically detrimental to performance unless you have a workload which knows how to use them.
But the real question is what kind of traffic is expected here. Perhaps there is 0 reason to go with such a box in the first place. You really want to consult someone versed in the area and not just buy/rent a box because you can.

Related

Parallel computing: how to share computing resources among users?

I am running a simulation on a Linux machine with the following specs.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz
Stepping: 4
CPU MHz: 3099.902
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
This is the run command line script for my solver.
/path/to/meshfree/installation/folder/meshfree_run.sh # on 1 (serial) worker
/path/to/meshfree/installation/folder/meshfree_run.sh N # on N parallel MPI processes
I share the system with another colleague of mine. He uses 10 cores for his solution. What would be the fastest option for me in this case? Using 30 MPI processes?
I am a Mechanical Engineer with very little knowledge on parallel computing. So please excuse me if the question is too stupid.
Q : "What would be the fastest option for me in this case? ...running short on time. I am already in the middle of a simulation."
Salutes to Aachen. If it were not for the ex-post remark, the fastest option would be to pre-configure the computing eco-system so that:
check full details of your NUMA device - using lstopo, or lstopo-no-graphics -.ascii not the lscpu
initiate your jobs having as many as possible MPI-worker processes mapped on physical (and best each one exclusively mapped onto its private) CPU-core ( as these deserve this as they carry the core FEM / meshing processing workload )
if your FH policy does not forbid one doing so, you may ask system administrator to introduce CPU-affinity mapping ( that will protect your in-cache data from eviction and expensive re-fetches, that would make 10-CPUs mapped exclusively for use by your colleague and the said 30-CPUs mapped exclusively for your application runs and the rest of the listed resources ~ the 40-CPUs ~ being "shared"-for-use by both, by your respective CPU-affinity masks.
Q : "Using 30 MPI processes?"
No, this is not a reasonable assumption for ASAP processing - use as many CPUs for workers, as possible for an already MPI-parallelised FEM-simulations ( they have high degree of parallelism and most often a by-nature "narrow"-locality ( be it represented as a sparse-matrix / N-band-matrix ) solvers, so the parallel-portion is often very high, compared to other numerical problems ) - the Amdahl's Law explains why.
Sure, there might be some academic-objections about some slight difference possible, for cases, where the communication overheads might got slightly reduced on one-less worker(s), yet the need for a brute-force processing rules in FEM/meshed-solvers ( communication costs are typically way less expensive, than the large-scale, FEM-segmented numerical computing part, sending but a small amount of neighbouring blocks' "boundary"-node's state data )
The htop will show you the actual state ( may note process:CPU-core wandering around, due to HT / CPU-core Thermal-balancing tricks, that decrease the resulting performance )
And do consult the meshfree Support for their Knowledge Base sources on Best Practices.
Next time - the best option would be to acquire a less restrictive computing infrastructure for processing critical workloads ( given a business-critical conditions consider this to be the risk of smooth BAU, the more if impacting your business-continuity ).

Understanding output of lscpu

You can see the output from lscpu command -
jack#042:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5201.37
Virtualization: VT-x
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55
I can see that there are 2 sockets (which is like a processor ??) and inside each of the socket we have 14 cores. So, in total 2x14=28 physical cores. Normally, a CPU can contain multiple cores, so number of CPUs can never be smaller than number of Cores. But, as shown in the output CPUs(s): 56 and this is what is confusing me.
I can see that Thread(s) per core: 2, so these 28 cores can behave like 2x28=56 logical cores.
Question 1: What does this CPUs(s): 56 denote? Does CPU(s) denote number of Virtual/Logical cores, as it cannot be a Physical core atleast?
Question 2: What does this NUMA node mean? Does it represent the socket?
(Copied at the OP’s request.)
“CPU(s): 56” represents the number of logical cores, which equals “Thread(s) per core” × “Core(s) per socket” × “Socket(s)”. One socket is one physical CPU package (which occupies one socket on the motherboard); each socket hosts a number of physical cores, and each core can run one or more threads. In your case, you have two sockets, each containing a 14-core Xeon E5-2690 v4 CPU, and since that supports hyper-threading with two threads, each core can run two threads.
“NUMA node” represents the memory architecture; “NUMA” stands for “non-uniform memory architecture”. In your system, each socket is attached to certain DIMM slots, and each physical CPU package contains a memory controller which handles part of the total RAM. As a result, not all physical memory is equally accessible from all CPUs: one physical CPU can directly access the memory it controls, but has to go through the other physical CPU to access the rest of memory. In your system, logical cores 0–13 and 28–41 are in one NUMA node, the rest in the other. So yes, one NUMA node equals one socket, at least in typical multi-socket Xeon systems.
NUMA stands for Non-Uniform Memory Access. The value of NUMA nodes has to do with performance in terms of accessing the memory, and it's not involved in calculating the number of CPU's you have.
The calculation of 56 CPUs you are getting is based on
CPU's = number of sockets x number of cores per socket x number of threads per core
Here, 2 threads per core indicate that hyper-threading is enabled.
So, you don't have 56 physical processors, but rather a combination of sockets, cores and hyper-threading. The bottom line is that you can run 56 threads in parallel. You can think of sockets to be equivalent of a physical processor.
-- edited based on the excellent comment by Margaret Bloom.
Threads per core: A hardware thread is a sufficient set of registers to represent the current state of one software thread. A core with two hardware threads can execute instructions on behalf of two different software threads without incurring the overhead of context switches between them. The amount of real parallelism that it can achieve will vary depending on what the threads are doing and, on the processor make and model.
Cores per Socket: A core is what we traditionally think of as a processor or a CPU, and a socket is the interface between one or more cores and the memory system. A socket also is the physical connection between a chip or a multi-chip module and the main board. In addition to the cores, a chip/module typically will have at least two levels of memory cache. Each core typically will have its own L1 cache, and then all of the cores on the chip/module will have to share (i.e., compete for) access to at least one higher level cache and, to the main memory.
Socket(s): see above. Big systems (e.g., rack servers) often have more than one. Personal computers, less often.
NUMA...: I can't tell you much about NUMA except to say that communication between threads running on different NUMA nodes works differently from, and is more expensive than, communication between threads running on the same node.

c++ std::async : faster on 4 cores compared to 8 cores

I have 16000 jobs to perform.
Each job is independent. There is no shared memory, no interprocess communication, no lock or mutex.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
I use std::async to split jobs between cores.
If I split the jobs into 8 (2000 per core), computation time is 145.
If I split the jobs into 4 (4000 per core), computation time is 60.
Output after reduce is the same in both case.
If I monitor the CPU during computation (just using htop), things happen as expected (8 cores are used at 100% in first case, only 4 cores are used 100% in second case).
I am very confused why 4 cores would process much faster than 8.
The i7-8550U has 4 cores and 8 threads.
What is the difference? Quoting How-To Geek:
Hyper-threading was Intel’s first attempt to bring parallel
computation to consumer PCs. It debuted on desktop CPUs with the
Pentium 4 HT back in 2002. The Pentium 4’s of the day featured just a
single CPU core, so it could really only perform one task at a
time—even if it was able to switch between tasks quickly enough that
it seemed like multitasking. Hyper-threading attempted to make up for
that.
A single physical CPU core with hyper-threading appears as two logical
CPUs to an operating system. The CPU is still a single CPU, so it’s a
little bit of a cheat. While the operating system sees two CPUs for
each core, the actual CPU hardware only has a single set of execution
resources for each core. The CPU pretends it has more cores than it
does, and it uses its own logic to speed up program execution. In
other words, the operating system is tricked into seeing two CPUs for
each actual CPU core.
Hyper-threading allows the two logical CPU cores to share physical
execution resources. This can speed things up somewhat—if one virtual
CPU is stalled and waiting, the other virtual CPU can borrow its
execution resources. Hyper-threading can help speed your system up,
but it’s nowhere near as good as having actual additional cores.
By splitting the jobs to more cores than available - you are paying a big penalty.

Is 1 vCPU on Google Compute Engine basically half of 1 physical CPU core?

Google's Machine types page states that:
For the n1 series of machine types, a virtual CPU is implemented as a
single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy
Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)...etc
Assuming that a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system, then if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
Typically when software tells you how many cores they need they don't take hyper-threading into account. Remember, AMD didn't even have that (Hyper-Threading) until very recently. So 2 cores means 2 vCPUs. Yes, a single HT CPU core shows up as 2 CPUs to the OS, but does NOT quite perform as 2 truly independent CPU cores.
That's correct, you should aim for a GCE machine-type that has 4vCPUs... When you're migrating from an on-premises world, you're used to physical cores which have hyperthreading. In GCP, these are called vCPUs or virtual CPUs. A vCPU is equivalent to one hyperthread core. Therefore, if you have a single-core hyperthreaded CPU on premises, that would essentially be two virtual CPUs to one physical core. So always keep that in mind as oftentimes people will immediately do a test. They'll say, "I have a four-cores physical machine and I'm going to run four cores in the cloud" and ask "why their performance isn't the same?!!!"
if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
I believe, yes.
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
I think, they means 2 physical cores regardless of hyper threading (HT) because the performance of HT is not a stable reference.
But IMO, the recommendation should also contains speed of each physical core.
If the software recommends 2 CPU cores, you need 4 vCPUs on GCP.
https://cloud.google.com/compute/docs/cpu-platforms says:
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware multithread on one of the available CPU processors. On Intel Xeon processors, Intel Hyper-Threading Technology supports multiple app threads running on each physical processor core. You configure your Compute Engine VM instances with one or more of these multithreads as vCPUs. The specific size and shape of your VM instance determines the number of its vCPUs.
Long ago and far away, there was a 1 to 1 equivalence between a 'CPU' (such as what one sees in the output of "top"), a socket, a core, and a thread. (And "processor" and/or "chip" too if you like.)
So, many folks got into the habit of using two or more of those terms interchangeably. Particularly "CPU" and "core."
Then CPU designers started putting multiple cores on a single die/chip. So a "socket" or "processor" or "chip" was no longer a single core, but a "CPU" was still 1 to 1 with a "core." So, interchanging those two terms was still "ok."
Then CPU designers started putting multiple "threads" (eg hyperthreads) in a single core. The operating systems would present each hyperthread as a "CPU" so there was no longer a 1 to 1 correspondence between "CPU" and "thread" and "core."
And, different CPU families can have different numbers of threads per core.
But referring to "cores" when one means "CPUs" persists.

Why does the performance become bad after enabling hyperthread?

I port Linux kernel 2.6.32 to Intel(R) Xeon(R) CPU E31275 # 3.40GHz. If I enable hyperthread in BIOS, I can see 8 CPU cores (CPU0 ~ CPU7). Most of interrupts occur in CPU 4, and the CPU usage of this core is much higher than others (almost twice than others). I don't understand it very well, because I think I didn't set any IRQ binding operations.
If I disable hyperthread in BIOS, then everything is OK. The IRQs have been balanced, and the CPU usage of all cores (CPU0 ~ CPU3) are nearly balanced, too.
Can someone explain it? Is it BIOS related? Should I do some special settings in kernel?
Some programs get a negative effect from HT (Hyper Threading), to explain this you have to understand what HT is.
As you said you saw 7 (0-7 is altough 8) cpu cores, this is not true, u have 4 cores in your CPU, the 8 cores are virtual cores, so one core has 2 threads (and acts like he is 2 cores).
Usually HT helps for running programs faster due to the fact the CPU/OS is able to run (doing what ever these programs do) 8 programs at the same time, without HT you can only run 4 at the same time.
You dont have to set any settings since you cant change this appearance, if you are the developer of this program you should recheck the code and optimize it for HT if you want, or you can just disable HT.
Another information due to some bullshit peoples are talking: HT is increasing the power of the CPU
this is NOT true! even when u see 8 cores with lets say 4GHz (GHz says nothing, should be measured in flops) you got the same power as when u turn HT of and got 4 cores with 4GHz.
If you got HT on the 2 virtual cores are sharing 1 physical core from ur CPU.
Here some more informations about HT:
http://www.makeuseof.com/tag/hyperthreading-technology-explained/
I couldnt find my old link to a very nice site where there are code snippets who shows bad code for HT and good code (in the meaning bad of being slower than without HT and the opposite).
TL;DR: not every program benefetis from HT due to its development.

Resources