Why does the performance become bad after enabling hyperthread? - linux

I port Linux kernel 2.6.32 to Intel(R) Xeon(R) CPU E31275 # 3.40GHz. If I enable hyperthread in BIOS, I can see 8 CPU cores (CPU0 ~ CPU7). Most of interrupts occur in CPU 4, and the CPU usage of this core is much higher than others (almost twice than others). I don't understand it very well, because I think I didn't set any IRQ binding operations.
If I disable hyperthread in BIOS, then everything is OK. The IRQs have been balanced, and the CPU usage of all cores (CPU0 ~ CPU3) are nearly balanced, too.
Can someone explain it? Is it BIOS related? Should I do some special settings in kernel?

Some programs get a negative effect from HT (Hyper Threading), to explain this you have to understand what HT is.
As you said you saw 7 (0-7 is altough 8) cpu cores, this is not true, u have 4 cores in your CPU, the 8 cores are virtual cores, so one core has 2 threads (and acts like he is 2 cores).
Usually HT helps for running programs faster due to the fact the CPU/OS is able to run (doing what ever these programs do) 8 programs at the same time, without HT you can only run 4 at the same time.
You dont have to set any settings since you cant change this appearance, if you are the developer of this program you should recheck the code and optimize it for HT if you want, or you can just disable HT.
Another information due to some bullshit peoples are talking: HT is increasing the power of the CPU
this is NOT true! even when u see 8 cores with lets say 4GHz (GHz says nothing, should be measured in flops) you got the same power as when u turn HT of and got 4 cores with 4GHz.
If you got HT on the 2 virtual cores are sharing 1 physical core from ur CPU.
Here some more informations about HT:
http://www.makeuseof.com/tag/hyperthreading-technology-explained/
I couldnt find my old link to a very nice site where there are code snippets who shows bad code for HT and good code (in the meaning bad of being slower than without HT and the opposite).
TL;DR: not every program benefetis from HT due to its development.

Related

c++ std::async : faster on 4 cores compared to 8 cores

I have 16000 jobs to perform.
Each job is independent. There is no shared memory, no interprocess communication, no lock or mutex.
I am on ubuntu 16.06. c++11. Intel® Core™ i7-8550U CPU # 1.80GHz × 8
I use std::async to split jobs between cores.
If I split the jobs into 8 (2000 per core), computation time is 145.
If I split the jobs into 4 (4000 per core), computation time is 60.
Output after reduce is the same in both case.
If I monitor the CPU during computation (just using htop), things happen as expected (8 cores are used at 100% in first case, only 4 cores are used 100% in second case).
I am very confused why 4 cores would process much faster than 8.
The i7-8550U has 4 cores and 8 threads.
What is the difference? Quoting How-To Geek:
Hyper-threading was Intel’s first attempt to bring parallel
computation to consumer PCs. It debuted on desktop CPUs with the
Pentium 4 HT back in 2002. The Pentium 4’s of the day featured just a
single CPU core, so it could really only perform one task at a
time—even if it was able to switch between tasks quickly enough that
it seemed like multitasking. Hyper-threading attempted to make up for
that.
A single physical CPU core with hyper-threading appears as two logical
CPUs to an operating system. The CPU is still a single CPU, so it’s a
little bit of a cheat. While the operating system sees two CPUs for
each core, the actual CPU hardware only has a single set of execution
resources for each core. The CPU pretends it has more cores than it
does, and it uses its own logic to speed up program execution. In
other words, the operating system is tricked into seeing two CPUs for
each actual CPU core.
Hyper-threading allows the two logical CPU cores to share physical
execution resources. This can speed things up somewhat—if one virtual
CPU is stalled and waiting, the other virtual CPU can borrow its
execution resources. Hyper-threading can help speed your system up,
but it’s nowhere near as good as having actual additional cores.
By splitting the jobs to more cores than available - you are paying a big penalty.

Is 1 vCPU on Google Compute Engine basically half of 1 physical CPU core?

Google's Machine types page states that:
For the n1 series of machine types, a virtual CPU is implemented as a
single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy
Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)...etc
Assuming that a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system, then if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
Typically when software tells you how many cores they need they don't take hyper-threading into account. Remember, AMD didn't even have that (Hyper-Threading) until very recently. So 2 cores means 2 vCPUs. Yes, a single HT CPU core shows up as 2 CPUs to the OS, but does NOT quite perform as 2 truly independent CPU cores.
That's correct, you should aim for a GCE machine-type that has 4vCPUs... When you're migrating from an on-premises world, you're used to physical cores which have hyperthreading. In GCP, these are called vCPUs or virtual CPUs. A vCPU is equivalent to one hyperthread core. Therefore, if you have a single-core hyperthreaded CPU on premises, that would essentially be two virtual CPUs to one physical core. So always keep that in mind as oftentimes people will immediately do a test. They'll say, "I have a four-cores physical machine and I'm going to run four cores in the cloud" and ask "why their performance isn't the same?!!!"
if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
I believe, yes.
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
I think, they means 2 physical cores regardless of hyper threading (HT) because the performance of HT is not a stable reference.
But IMO, the recommendation should also contains speed of each physical core.
If the software recommends 2 CPU cores, you need 4 vCPUs on GCP.
https://cloud.google.com/compute/docs/cpu-platforms says:
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware multithread on one of the available CPU processors. On Intel Xeon processors, Intel Hyper-Threading Technology supports multiple app threads running on each physical processor core. You configure your Compute Engine VM instances with one or more of these multithreads as vCPUs. The specific size and shape of your VM instance determines the number of its vCPUs.
Long ago and far away, there was a 1 to 1 equivalence between a 'CPU' (such as what one sees in the output of "top"), a socket, a core, and a thread. (And "processor" and/or "chip" too if you like.)
So, many folks got into the habit of using two or more of those terms interchangeably. Particularly "CPU" and "core."
Then CPU designers started putting multiple cores on a single die/chip. So a "socket" or "processor" or "chip" was no longer a single core, but a "CPU" was still 1 to 1 with a "core." So, interchanging those two terms was still "ok."
Then CPU designers started putting multiple "threads" (eg hyperthreads) in a single core. The operating systems would present each hyperthread as a "CPU" so there was no longer a 1 to 1 correspondence between "CPU" and "thread" and "core."
And, different CPU families can have different numbers of threads per core.
But referring to "cores" when one means "CPUs" persists.

Curious about how to specify the number of core for MPI in order to get the fastest scientific computation

I have been running several scientific program package in conjunction with MPI by using the following command
nohup mpirun -np N -x OMP_NUM_THREADS=M program.exe < input > output &
where the value of N and M depend on the physical CPU cores of my machine. For example, my machine has the specification like this
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2440 0 # 2.40GHz
Stepping: 7
In this case, I first tried setting with N = 24 and M = 1, so the calculation ran very slowly. Then I changed N and M to 12 and 2 respectively. So I found that the latter had obviously provided me the fastest computation.
I was wondering that why did I set N & M are 12 and 2 provide more performance higher than the first case ?
there is no absolute rule on how to run MPI+OpenMP application.
the only advice is not to run an OpenMP process on more than one socket
(OpenMP was designed for SMP machines with flat memory access, but today, most systems are NUMA)
then just experiment.
some apps run best in flat MPI (e.g. one thread per task), while some other work best with one MPI task per socket, and all available cores for OpenMP.
last but not least, if you run more than one OpenMP thread per MPI task, make sure your MPI library bound the MPI tasks as expected.
for example, if you run with 12 OpenMP threads but MPI bind tasks to one core, you will end up doing time sharing and performance will be horrible.
or if you run with 12 OpenMP threads, and MPI task was bound to 12 cores, make sure the 12 cores are on the same socket (and not 6 on each socket)
There is no general rule about this because, most of the time, this performance is dependent on the computation properties of the application itself.
Applications with coarse synchronization granularity may scale well using plain MPI code (no multithreading).
If the synchronization granularity is fine, then using shared memory multithreading (such as OpenMP) and placing all the threads in a process close to each other (in the same socket) becomes more important: synchronization is cheaper and memory access latency is critical.
Finally, compute-bound applications (performance is limited by the processor) are likely not to benefit from hyper-threading at all, since two threads sharing a core contend for the functional units it contains. In this case, you may find applications that perform better using N=2 and M=6 than using N=2 and M=12.
indeeed there is no absolute rule on how to run MPI+OpenMP application.
I agree with all Gilles said.
so I want to talk about the CPU in your case.
in the specification you give, it shows the system enables hyper-thread.
but this not always helps. your computer has 12 physical cores in fact.
so I advice you try some combinations that make M * N = 12 to 24,
like 12*1, 6*2, 6*3
which one is best, depends on how well your application.

Matlabpool number of threads vs core

I have a laptop running Ubuntu on Intel(R) Core(TM) i5-2410M CPU # 2.30GHz. According to Intel website for the above processor (located here), this processor has two cores and can run 4 threads at a time in parallel (because although it has 2 physical cores it has 4 logical cores).
When I start matlabpool it starts with local configuration and says it has connected to 2 labs. I suppose this means that it can run 2 threads in parallel. Does it not know that the CPU can actually run 4 threads in parallel?
In my experience, the local configuration of matlabpool uses, by default, the number of physical cores a machine possesses, rather than the number of logical cores. Hence on your machine, matlabpool only connects to two labs.
However, this is just a setting and can be overwritten with the following command:
matlabpool poolsize n
where n is an integer between 1 and 12 denoting the number of labs you want Matlab to use.
Now we get to the interesting bit that I'm a bit better equipped to answer thanks to a quick lesson from #RodyOldenhuis in the comments.
Hyper-threading implies a given physical core can have two threads run through it at the same time. Of course, they can't literally be processed simultaneously. The idea goes more like this: If one of the threads is inefficient in allocating tasks to the core, then the core may exhibit some "down-time". A second thread can take advantage of this "down-time" to get some work done.
In my experience, Matlab is often efficient in its allocation of threads to cores, therefore with one Matlab thread (ie one lab) running through it, a core may have very little "down-time" and hence there will be very little advantage to hyper-threading. My desktop is a core-i7 with 4 physical cores but 8 logical cores. However, I notice very little difference between running a parfor loop with 4 labs versus 8 labs. In fact, 8 labs is often slower due to the start-up costs associated with initializing the extra labs.
Of course, this is probably all complicated by other external factors such as what other programs you might be running simultaneously to Matlab too.
In summary, my suspicion is that even though you could force Matlab to initialize 4 labs (or even 12 labs), you won't see much of a speed-up over 2 labs, since Matlab is generally fairly efficient at allocating tasks to the processor.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!
My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.
Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.
When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Resources