Intel 12th Gen only efficient core were used for the computation

Intel 12th Gen only efficient core were used for the computation - python-3.x

Using an Intel 12th gen laptop on Windows 11 with parallelization for a computational task.
In the first run both the power cores and the efficient cores were activated, with the power core running at 100% and the efficient core at 80%.
After a while the utilization of the efficient core slowly went up to 100%. But on the second run, the power cores were never used and only the efficient cores were at 100%.
It's not a displacement mistake from the windows 11 because the computer temperature cooled and the fan stopped buzzering. But strangely with the power core at rest the computer still became stuck and froze.
The third run both the power cores and the efficient cores were actived.
How was this possible and how to tell python to use all the cores?

Related

How to utilize the High Performance cores on Apple Silicon

I have developed a macOS app which is heavily relying on multithreading (a call center simulator). It runs fine on my iMac 2019 and fills up all cores nicely. In my test scenario it simulates app. 1.4 mio. telephone calls in total in 100 iterations, each iteration as a dispatch item on a parallel dispatch queue.
Now I have bought a new Mac mini with M1 Apple Silicon and I was eager to see how the performance develops on that test machine. Well, it’s not bad but not as good as I expected:
System
Duration
iMac 2019, Intel 6-core i5, 3.0 GHz, Catalina macOS 10.15.7
19.95 s
Mac mini, M1 8-core, Big Sur macOS 11.2, Rosetta2
26.85 s
Mac mini, M1 8-core, Big Sur macOS 11.2, native ARM
17.07 s
Investigating a little bit further I noticed that at the start of the simulation all 8 cores of the M1 Mac are filled up properly but after a few seconds only the 4 high efficiency cores are used any more.
I have read the Apple docs „Optimize for Apple Silicon with performance and efficiency cores“ and double checked that the dispatch queue for the iterations is set up properly:
let simQueue = DispatchQueue.global(qos: .userInitiated)
But no success. After a few seconds of running the high performance cores are obviously not utilized any more. I even tried to set up the queue with qos set to .userInteracive up that didn’t help either. I also flagged the dispatch items with proper qos but that didn’t change anything. It looks to me that other apps (e.g. XCode) do utilize the high performance cores even for a longer time.
Does anybody know how to force a M1 Mac to utilize the high performance cores?

"M1 8 core" is really "M1 4 performance + 4 power saving cores". I expect it to have be a bit more performance than an Intel 6 core, but not much. Exactly has you see, 15% faster than six Intel cores or about as fast as 7 Intel cores would be. The current M1 chips are low end processors. "A bit better than Intel six cores" is quite good.
Your code must be running on the performance cores, otherwise there would be no chance at all to come close to the Intel performance. In that graph, nothing tells you which cores are used.
What happens most likely is that all cores start running, each trying to do one eighth of the work, and after about 8 seconds the performance cores have their work done. Then the power saving cores move their work to the performance cores. And you are just misinterpreting the image as only low performance cores doing the work.

I would guess that Apple has put a preference on using efficiency cores over performance for many reasons. Battery life being one, and most likely thermal reasons as well. This is the big question mark with a SoC that originally was designed for smartphones and tablets. MacOS is a much heavier OS then IOS or iPad OS. Apple most likely felt that performance cores were only needed in the cases where maximum throughput was needed. No doubt, I think some including myself with a M1 Mac Mini would like a way to adjust this balance between efficiency and performance cores. Personally overall, I would prefer all cores be capable of switching between efficiency and performance such as in Intel's Speed shift technology. This may come along with the M1's advancements in terms of Mac Pro models and other Pro models.

Using all physical or logical cores for multiprocessing?

I think this question isn't new, but I couldn't find a clear answer for this: I have a Fortran-code and a Intel core i7 with 6 physical and 12 logical cores. At this moment my code is running on the 6 physical cores, i.e. on each core is running the same code. I saw at the inter power gadget that the utilization is ca. 50%, therefore I want to know if I get my results (nearly) twice as fast if I run the code on the 12 logical cores. I have also heard that I get in certain circumstances my results slower if I use the logical cores for running the code, so I'm not sure about what to do.
Thank you for your help!

GPU vs CPU? Number of cores/threads in a GPU for program calculation acceleration?

I need some help understanding the concept of cores on a GPU vs. cores in a CPU for the purpose of doing parallel calculations.
When it comes to cores in a CPU, it seems pretty simple. I have a super intensive "for" loop that iterates four times. I have four cores in my Intel i5 2.26GHz CPU. I give one loop to each core. Each of the four loops is independent of the other. Boom - I now have four threads created and 100% CPU usage (instead of 25% CPU usage with only one core). My "for" loop now runs almost four times faster than it would have if I did not parallelize it. By the way, for the "for" loop, I was using the auto-parallelization available on Microsoft Visual Studio 2012, as in this online example:(http://msdn.microsoft.com/en-us/library/hh872235.aspx).
In contrast, I don't even know the number of cores in my laptop's GPU (Intel Graphics Media Accelerator HD, or Intel HD Graphics, with 1696MB shared memory) that I can use for parallel calculations. I don't even know a valid way of comparing the GPU to the CPU. When I see "12#500MHz" next to my graphics card description, I wonder if that means the graphics card has 12 cores for parallelization that can work kinda like the 4 cores in a CPU, except that the GPU cores run at 500MHz [slow] instead of 2.26GHz [fast]? Is there a GPU usage comparable to the CPU usage in Windows task manager? I'm an utter novice trying to use the C++ library in visual studio 2012, if that makes any difference. When I write the actual GPU software, the parallelization code looks like this:(http://msdn.microsoft.com/en-us/library/hh265137.aspx).
So, would you please fill some of the gaps or mistakes in my knowledge or help me compare the two? I don't need a super complicated answer, something as simple as "You can't compare a CPU core with a GPU core because of blankity blank" or "a GPU core isn't really a core like a CPU core is" would be very much appreciated.

First, the OS initiate more cores only if you ask for them in your code. Try using OpenMP or Win32 threads to achieve parallelism on your i5.
Second, the CPU clocking is more than GPU clocking. If the clocking of GPU is same as CPU, you can use it as a stove to cook. The cores in the GPU are more than CPU. There is a difference between a thread and core.
Third, I recommend you to read specifications and reference manuals for your CPU and GPU. Also, dont forget PCI-e. It is the bottleneck for Parallel Programming implementation.
Hope this clarifies your doubts. Any more questions, feel free to ask.

How would a multithreaded program be more energy efficient?

In its Energy-Efficient Software Guidelines Intel suggests that programs are designed multithreaded for better energy efficiency.
I don't get it. Suppose I have a quad core processor that can switch off unused cores. Suppose my code is perfectly parallelizeable (synchronization overhead is negligible).
If I use only one core I burn one core for one hour, if I use four cores I burn four cores for 15 minutes - the same amount of core-hours either way. Where's the saving?

I suspect it has to do with a non-linear relation between CPU utilization and power consumption. So if you can spread 100% CPU utilization over 4 CPUs each will have 25% utilization - and say 12% consumption.
This is especially true when dynamic CPU scaling is used according to Wikipedia the power drain of a CPU is P = C(V^2)F. When a CPU is running faster it requires higher voltages - and that 'to the power of 2' becomes crucial. Furthermore the voltage will be a function of F (which means F can be solved for V) giving something like P = C(F^2)F. Thus by spreading the load over 4 CPUs (running at 100% capacity at that frequency) you can mitigate the cost for the same work.
We can make F a function of L (load) at 100% of one core (as it would be in your OS), so:
F = 1000 + L/100 * 500 = 1000 + 5L
p = C((1000 + 5L)^2)(1000 + 5L) = C(1000 + 5L)^3
Now that we can relate load (L) to the power consumption we can see the characteristics of the power consumption given everything on one core:
p = C(1000 + 5L)^3
p = 1000000000 + 15000000L + 75000L^2 + 125L^3
Or spread over 4 cores:
p = 4C(1000 + (5/4)L)^3
p = 4000000000 + 15000000L + 18750.4L^2 + 7.5L^3
Notice the factors in front of the L^2 and L^3.

During that one hour, the one core isn't the only thing you keep running.

You burn 4 times energy with 4 cores but you do 4 times more work too! If, as you said, the synchro is negligible and the work is parallelizable, you'll spend 4 times less time.
Using multiple threads can save energy when you have i/o waits. One thread can wait while other threads can perform other computations; instead of having your application idle.

A CPU is one part of a computer. It has fans, a motherboard, hard drives, graphics card, RAM etc, lets call this the BASE. If your doing scientific computing (i.e., a compute cluster) you are powering many computers. If you are powering 100's of BASE's anyway, why not allow those BASES to have multiple physical CPU's on them so those CPU's can share the resources of the BASE, physical and logical.
Now INTEL's marketing blurb probably also depends on the fact that these days, each CPU wafer contains multiple cores. Powering multiple physical CPU's is different to powering a single physical cpu with multiple cores.
So if amount of work done per unit of power is the benchmark in question, then modern CPU's performing highly parallel tasks then yes you get more bang for your buck, compared with the previous generation of processors. As not only can you get more cores / cpu, it is also common to get BASE's which can take multiple CPU's.
One may easily assert that one top-end system can now house the processing power of 8-16 singl-cpu single-core CPU's of the past (assuming that in this hypothetical case, that on the new system and the older generation system, each core has the same processing power ).

If a program is multithreaded that doesn't mean that it would use more cores. It just means that more tasks are dealt with in the same time so the overall processor time is shorter.

There are 3 reasons, two of which have already been pointed out:
More overall time means that other (non-CPU) components need to run longer, even if the net calculation for the CPU remains the same
More threads mean more things are done at the same time (because stalls are used for something useful), again the overall real time is reduced.
The CPU power consumption for running the same calculations on one core is not the same. Intel CPUs have a built-in clock boosting for single-core usage (I forgot the marketing buzzword for it). A higher clock means dysproportionally more power consumption and dysproportionally more heat, which again requires the fan to spin faster, too.
So in summary, you consume more power with the CPU and more power for cooling the CPU for a longer time, and you run other components for a longer time, too.
As a 4th reason, one could allege (note that this is only an assumption!) that Intel CPUs are hyperthreaded, and since hyperthreaded cores share some resources, running two threads at once is more efficient than running one thread twice as long.

Linux per-process resource limits - a deep Red Hat Mystery

I have my own multithreaded C program which scales in speed smoothly with the number of CPU cores.. I can run it with 1, 2, 3, etc threads and get linear speedup.. up to about 5.5x speed on a 6-core CPU on a Ubuntu Linux box.
I had an opportunity to run the program on a very high end Sunfire x4450 with 4 quad-core Xeon processors, running Red Hat Enterprise Linux. I was eagerly anticipating seeing how fast the 16 cores could run my program with 16 threads..
But it runs at the same speed as just TWO threads!
Much hair-pulling and debugging later, I see that my program really is creating all the threads, they really are running simultaneously, but the threads themselves are slower than they should be. 2 threads runs about 1.7x faster than 1, but 3, 4, 8, 10, 16 threads all run at just net 1.9x! I can see all the threads are running (not stalled or sleeping), they're just slow.
To check that the HARDWARE wasn't at fault, I ran SIXTEEN copies of my program independently, simultaneously. They all ran at full speed. There really are 16 cores and they really do run at full speed and there really is enough RAM (in fact this machine has 64GB, and I only use 1GB per process).
So, my question is if there's some OPERATING SYSTEM explanation, perhaps some per-process resource limit which automatically scales back thread scheduling to keep one process from hogging the machine.
Clues are:
My program does not access the disk or network. It's CPU limited. Its speed scales linearly on a
single CPU box in Ubuntu Linux with
a hexacore i7 for 1-6 threads. 6
threads is effectively 6x speedup.
My program never runs faster than
2x speedup on this 16 core Sunfire
Xeon box, for any number of threads
from 2-16.
Running 16 copies of
my program single threaded runs
perfectly, all 16 running at once at
full speed.
top shows 1600% of
CPUs allocated. /proc/cpuinfo shows
all 16 cores running at full 2.9GHz
speed (not low frequency idle speed
of 1.6GHz)
There's 48GB of RAM free, it is not swapping.
What's happening? Is there some process CPU limit policy? How could I measure it if so?
What else could explain this behavior?
Thanks for your ideas to solve this, the Great Xeon Slowdown Mystery of 2010!

My initial guess would be shared memory bottlenecks. From what you say, your performance pretty much flatlines after 2 CPUs. You initially blame Redhat, but I'd be curious to see what happens if you install Ubuntu on the same hardware. I assume, of course, that you're running 64 bit SMP kernels across both tests.
It's probably not possible that the motherboard would peak at utilizing 2 CPUs. You have another machine with multiple cores that has provided better performance. Do you have hyperthreading turned on with the new machine? (and how does that answer compare to the old machine?). You're not, by chance, running in a virtualized environment?
Overall, your evidence is pointing to a ludicrously slow bottleneck somewhere. As you said, you're not I/O bound, so that leaves the CPU and memory. Either something is wrong with the hardware, or something is wrong with the hardware. Test one by changing the other, and you'll narrow down your possibilities quickly.

Do some research on rlimit - it's quite possible the shell/user acct you're running in has some RH-default or admin-set resource limits in place.

When you see this kind of odd scaling behaviour, especially if problems are seen with multiple threads, but not multiple processes, one thing to start looking at is the impacts of lock contention and other synchronisation primitives, which can cause threads running on different processors to have to wait for each other, potentially forcing multiple cores to flush their cache to main memory.
This means memory architecture starts to come into play, and that's going to be substantially faster when you have 6 cores on a single piece of silicon than when you're coordinating across 4 separate processors. Specifically, the single CPU case likely isn't needing to hit main memory for locking operations at all - everything is likely being handled at the L3 cache level, allowing the CPU to get on with things while data is flushed to main memory in the background.
While I expect the OP has lost interest in the question after all this time (or may not even have access to the hardware any more), one way to check this would be to see if the scaling up to 4 threads improves if the process affinity is set to lock it to a single physical CPU. Even better though would be to profile the application itself to see where it is spending it's time.As you change architectures and increase the number of cores, it gets harder and harder to guess where the bottlenecks are, so you really need to start measuring things directly, as in this example: http://postgresql.1045698.n5.nabble.com/Sun-Donated-a-Sun-Fire-T2000-to-the-PostgreSQL-community-td2057445.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string