Inconsistency in use of multiple threads numpy/LAPACK/eigvalsh in different machines - python-3.x

I need to diagonalize complex (hermitean) matrices of dimensions > 2000 using numpy.linalg.eigvalsh. On one computer, top shows that numpy is multithreading, while in the other it shows a single thread. Both computers have essentially identical OS's (Arch Linux, python 3.10). The output of numpy.show_config() is absolutely identical in both machines. The machine in which I see multithreading is a laptop with 16GB RAM and i7-8550U # 1.80GHz CPU (4 physical cores). The one in which I don´t see it has an Intel(R) Xeon(R) Gold 6240R CPU # 2.40GHz CPU (48 physical cores) and 180 BG RAM. Is this behaviour expected? What am I missing? Thanks!

Related

what could be the reason made a program running on two compute with a big different ipc(instrument per second)?

OS: Linux 4.4.198-1.el7.elrepo.x86_64
CPU: Intel(R) Xeon(R) Gold 6148 CPU # 2.40GHz * 4
MEM: 376 GB
I have a program(do some LSTM model inference based on TensorFlow 1.14),it runs on two machines with the same hardware, one got a bad performance while the other got a better one(about 10x diff).
I used intel pqos tool diagnosed that two processes and got a big different IPC number (one is 0.07 while the other one is 2.5), both processes are binding on some specified CPU core, and each machine payload does not heavy. This problem appears two weeks ago, before that this bad machine works as well, history command shows nothing configuration changed.
I checked many environ information, including the kernel, fs, process scheduler, io scheduler, program, and libraries md5, they are the same, the bad computer iLo show none error, the program mainly burns the CPU.
I used sysbench to test two machines(cpu & memory) which show about 25% performance difference, the bad machine does prime calculation is slower. Could it be some hardware problem?
I don't know what is the root cause resulted in the difference in IPC (equaled to performance), How can I dig into the situation?

Dual processor vs single processor for a LAMP server

I have a question regarding server performance of a dual processor socket server vs a single processor socket server. My two choices are as follows. Assume that the rest like RAM and HD are identical.
1 Processor:
Xeon E5-2630 v3 # 2.40GHz
Socket: LGA2011-v3
Clockspeed: 2.4 GHz
Turbo Speed: 3.2 GHz
No of Cores: 8 (2 logical cores per physical)
VS.
2 Processors:
Xeon E3-1270 v3 # 3.50GHz
Socket: LGA1150
Clockspeed: 3.5 GHz
Turbo Speed: 3.9 GHz
No of Cores: 4 (2 logical cores per physical)
Combined, the number of cores is the same, so in theory the server performance should be the same right?
The server is a LAMP box with a huge database constantly being queried with select queries.
Combined, the number of cores is the same, so in theory the server performance should be the same right?
The stats you listed yourself differ, why would you expect the same level of performance?
Most bottlenecks in real-world non-optimized codebases stem from cache misses, so you want a machine with more cpu cache. That's the first one.
Similarly, more than one socket is typically detrimental to performance unless you have a workload which knows how to use them.
But the real question is what kind of traffic is expected here. Perhaps there is 0 reason to go with such a box in the first place. You really want to consult someone versed in the area and not just buy/rent a box because you can.

Mac OS only 4 out of 8 cores being utilized

I have a very cpu-intensive program that I run on my Mac OS. It doesn't have any blocking IO, but just computing matrix multiplications.
I noticed that it only see 4 out of 8 cores of my macbook pro being used, based on the histogram below. any idea why?
Mac OS CPU Utilization
Mac Configuration
8 logic core 100% utilization by using spawning 8 processes

Is 1 vCPU on Google Compute Engine basically half of 1 physical CPU core?

Google's Machine types page states that:
For the n1 series of machine types, a virtual CPU is implemented as a
single hardware hyper-thread on a 2.6 GHz Intel Xeon E5 (Sandy
Bridge), 2.5 GHz Intel Xeon E5 v2 (Ivy Bridge)...etc
Assuming that a single physical CPU core with hyper-threading appears as two logical CPUs to an operating system, then if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
Typically when software tells you how many cores they need they don't take hyper-threading into account. Remember, AMD didn't even have that (Hyper-Threading) until very recently. So 2 cores means 2 vCPUs. Yes, a single HT CPU core shows up as 2 CPUs to the OS, but does NOT quite perform as 2 truly independent CPU cores.
That's correct, you should aim for a GCE machine-type that has 4vCPUs... When you're migrating from an on-premises world, you're used to physical cores which have hyperthreading. In GCP, these are called vCPUs or virtual CPUs. A vCPU is equivalent to one hyperthread core. Therefore, if you have a single-core hyperthreaded CPU on premises, that would essentially be two virtual CPUs to one physical core. So always keep that in mind as oftentimes people will immediately do a test. They'll say, "I have a four-cores physical machine and I'm going to run four cores in the cloud" and ask "why their performance isn't the same?!!!"
if the n1-standard-2 machine that is described as 2 virtual CPUs and 7.5 GB of memory, then this essentially means 1 CPU core, right?
I believe, yes.
So if I'm trying to follow hardware recommendations for an InfluxDB instance that recommends 2 CPU cores, then I should aim for a Google Compute Engine machine that has 4vCPUs, correct?
I think, they means 2 physical cores regardless of hyper threading (HT) because the performance of HT is not a stable reference.
But IMO, the recommendation should also contains speed of each physical core.
If the software recommends 2 CPU cores, you need 4 vCPUs on GCP.
https://cloud.google.com/compute/docs/cpu-platforms says:
On Compute Engine, each virtual CPU (vCPU) is implemented as a single hardware multithread on one of the available CPU processors. On Intel Xeon processors, Intel Hyper-Threading Technology supports multiple app threads running on each physical processor core. You configure your Compute Engine VM instances with one or more of these multithreads as vCPUs. The specific size and shape of your VM instance determines the number of its vCPUs.
Long ago and far away, there was a 1 to 1 equivalence between a 'CPU' (such as what one sees in the output of "top"), a socket, a core, and a thread. (And "processor" and/or "chip" too if you like.)
So, many folks got into the habit of using two or more of those terms interchangeably. Particularly "CPU" and "core."
Then CPU designers started putting multiple cores on a single die/chip. So a "socket" or "processor" or "chip" was no longer a single core, but a "CPU" was still 1 to 1 with a "core." So, interchanging those two terms was still "ok."
Then CPU designers started putting multiple "threads" (eg hyperthreads) in a single core. The operating systems would present each hyperthread as a "CPU" so there was no longer a 1 to 1 correspondence between "CPU" and "thread" and "core."
And, different CPU families can have different numbers of threads per core.
But referring to "cores" when one means "CPUs" persists.

CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN not supported in the Intel/AMD OpenCL CPU runtime

Conditions:
I installed AMD OpenCL version AMD-APP-SDK-v2.8-lnx64 and Intel OpenCL version *intel_sdk_for_ocl_applications_xe_2013_r2_sdk_3.1.1.11385_x64* (version identification couldn't be more complex) according to the description on an HPC server with a dual socket Xeon E5-2650, Xeon Phi coprocessor, 64GB host memory and Red Hat Enterprise Server 6.4.
Problem description:
I would like to do device fission with OpenCL to get around the NUMA issue. Unfortunately the device (Intel CPU) or maybe the Linux kernel doesn't seem to support CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN. I tried both Intel OpenCL and AMD OpenCL. Although AMD OpenCL device query says that it supports the affinity domain option, actually it doesn't: when I try to run a code with CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN the clCreateSubDevices() function returns with -30 error code. I guess this is a bug in the current Intel OpenCL driver, according to a forum post.
Potential solution:
I thought that if I could select the first 16 parallel compute cores (8 cores + 8 hyper threads) (out of the total 32 parallel compute cores) those would map to the first socket. Unfortunately Intel OpenCL randomly distributes the 16 parallel compute cores across the 32 cores. AMD OpenCL on the other hand select the first 16 parallel compute cores, but the OpenCL compiler does a poor job on the kernel I'm running. So the no free lunch theorem applies here as well.
Questions:
Is there any way to specify which parallel compute cores the OpenCL should use for the computations?
Is there any way to overcome this NUMA issue with OpenCL?
Any comments regarding experiences with NUMA affinity are welcome.
Thank you!
UPDATE
A partial workaround, only applicable to single-socket testing:
(In Linux) Disable all cores from a NUMA node, so OpenCL ICD can only choose from hardware threads of the other NUMA node. Eg. on a 2 socket 32 HTT system:
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu31/online"
....
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu16/online"
I'm not sure if this hacking has no side effect, but so far it seems to work (for testing at least).

Resources