CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN not supported in the Intel/AMD OpenCL CPU runtime - multithreading

Conditions:
I installed AMD OpenCL version AMD-APP-SDK-v2.8-lnx64 and Intel OpenCL version *intel_sdk_for_ocl_applications_xe_2013_r2_sdk_3.1.1.11385_x64* (version identification couldn't be more complex) according to the description on an HPC server with a dual socket Xeon E5-2650, Xeon Phi coprocessor, 64GB host memory and Red Hat Enterprise Server 6.4.
Problem description:
I would like to do device fission with OpenCL to get around the NUMA issue. Unfortunately the device (Intel CPU) or maybe the Linux kernel doesn't seem to support CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN. I tried both Intel OpenCL and AMD OpenCL. Although AMD OpenCL device query says that it supports the affinity domain option, actually it doesn't: when I try to run a code with CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN the clCreateSubDevices() function returns with -30 error code. I guess this is a bug in the current Intel OpenCL driver, according to a forum post.
Potential solution:
I thought that if I could select the first 16 parallel compute cores (8 cores + 8 hyper threads) (out of the total 32 parallel compute cores) those would map to the first socket. Unfortunately Intel OpenCL randomly distributes the 16 parallel compute cores across the 32 cores. AMD OpenCL on the other hand select the first 16 parallel compute cores, but the OpenCL compiler does a poor job on the kernel I'm running. So the no free lunch theorem applies here as well.
Questions:
Is there any way to specify which parallel compute cores the OpenCL should use for the computations?
Is there any way to overcome this NUMA issue with OpenCL?
Any comments regarding experiences with NUMA affinity are welcome.
Thank you!
UPDATE
A partial workaround, only applicable to single-socket testing:
(In Linux) Disable all cores from a NUMA node, so OpenCL ICD can only choose from hardware threads of the other NUMA node. Eg. on a 2 socket 32 HTT system:
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu31/online"
....
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu16/online"
I'm not sure if this hacking has no side effect, but so far it seems to work (for testing at least).

Related

Emulating a heterogenous system, like an ARM Processor with P and E Cores

I'm trying to emulate a processor which consists processor cores with different max frequencies per core, like ARM processors or newer Intel processors which have a couple of Performance Cores and Efficiency Cores.
I tried it with Qemu, but I only didn't get far, the only thing I found was qemu-system-aarch64 where you can configure cores per die and die count using nema but i did't find a possibilty to change frequency or core architechture for a specific die. Is it even possible with qemu or is there a alternative? Preferably the emulation should be able to run linux.
For clarification, I'm trying to show that on a heterogeneus system i.e. a processor with different core speeds a certain framework works better then another one.
Thanks to Nate I found Intel Simics which is able to simulate heterogeneous systems.

Will update Ubuntu 22.04 kernel, with Alder Lake processor resolve parallelism problems , and without breaking my Nvidia drivers?

I recently bought a dell XPS 15 with a i9 12900 HK, and downloaded Ubuntu 22.04 LTS as an OS.
I coded a simple OpenMP program, that should have a linear speedup in the number of threads (the code is correct because it behaves as expected when I run it on a cluster), but on my laptop, it stops the speedup at 6 threads, even if my processor has 20 threads. I made some research, and read that kernel 5.15 is not optimised for last Intel processors because it makes a bad use of P and E cores.
But I also read that it may be dangerous to update the kernel to a newer version like 5.17 or 5.18 because my GTX 3050 Ti drivers may not be compatible with this kernel.
Can I update the kernel safely ? Will it resolves my parallelism problem? What method should I use to update my kernel?
I tried to look to forums and docs, but lots of available documentation are from third parties and I don't know if I can trust them.

Inconsistency in use of multiple threads numpy/LAPACK/eigvalsh in different machines

I need to diagonalize complex (hermitean) matrices of dimensions > 2000 using numpy.linalg.eigvalsh. On one computer, top shows that numpy is multithreading, while in the other it shows a single thread. Both computers have essentially identical OS's (Arch Linux, python 3.10). The output of numpy.show_config() is absolutely identical in both machines. The machine in which I see multithreading is a laptop with 16GB RAM and i7-8550U # 1.80GHz CPU (4 physical cores). The one in which I don´t see it has an Intel(R) Xeon(R) Gold 6240R CPU # 2.40GHz CPU (48 physical cores) and 180 BG RAM. Is this behaviour expected? What am I missing? Thanks!

Behavior of DX11::CreatePixelShader during multithreaded calls

Calling CreatePixelShader in a multithreaded fashion results in a sharp performance drop.
The code given is pre-compiled and no debugging information is added.
The size is a bit large (100kb), but a single-threaded call requires less than 1ms, while a multithreaded call may require more than 10ms.
Should CreatePixelShader not make multithreaded calls?
Or does it depend on the graphics driver it is running on?
development environment
OS -- Windows 10 pro
GPU -- NVIDIA GeForce 2070 super
CPU -- AMD Ryzen 7 3700X 8-Core Processor 3.59 GHz

Set number of cores in OpenMP

I'm running my program on an Intel® Xeon® Processor E5-1650 v3
http://ark.intel.com/products/82765/Intel-Xeon-Processor-E5-1650-v3-15M-Cache-3_50-GHz
The processor has 6 CPUs(6 cores), I'm trying to set the number of CPUs my prorgram is using. My application is using openmp.
I'm not trying to set the number of threads, but the number of CPUs. How can I do that?
Have you tried to use environment variables controlling thread affinity?
If you're compiling your code with gcc, you may want to use GOMP_CPU_AFFINITY or OMP_PLACES.
For Intel compilers, there are KMP_AFFINITY and KMP_PLACE_THREADS, see the Intel documentation.

Resources