Behavior of DX11::CreatePixelShader during multithreaded calls - multithreading

Calling CreatePixelShader in a multithreaded fashion results in a sharp performance drop.
The code given is pre-compiled and no debugging information is added.
The size is a bit large (100kb), but a single-threaded call requires less than 1ms, while a multithreaded call may require more than 10ms.
Should CreatePixelShader not make multithreaded calls?
Or does it depend on the graphics driver it is running on?
development environment
OS -- Windows 10 pro
GPU -- NVIDIA GeForce 2070 super
CPU -- AMD Ryzen 7 3700X 8-Core Processor 3.59 GHz

Related

Will update Ubuntu 22.04 kernel, with Alder Lake processor resolve parallelism problems , and without breaking my Nvidia drivers?

I recently bought a dell XPS 15 with a i9 12900 HK, and downloaded Ubuntu 22.04 LTS as an OS.
I coded a simple OpenMP program, that should have a linear speedup in the number of threads (the code is correct because it behaves as expected when I run it on a cluster), but on my laptop, it stops the speedup at 6 threads, even if my processor has 20 threads. I made some research, and read that kernel 5.15 is not optimised for last Intel processors because it makes a bad use of P and E cores.
But I also read that it may be dangerous to update the kernel to a newer version like 5.17 or 5.18 because my GTX 3050 Ti drivers may not be compatible with this kernel.
Can I update the kernel safely ? Will it resolves my parallelism problem? What method should I use to update my kernel?
I tried to look to forums and docs, but lots of available documentation are from third parties and I don't know if I can trust them.

How to utilize the High Performance cores on Apple Silicon

I have developed a macOS app which is heavily relying on multithreading (a call center simulator). It runs fine on my iMac 2019 and fills up all cores nicely. In my test scenario it simulates app. 1.4 mio. telephone calls in total in 100 iterations, each iteration as a dispatch item on a parallel dispatch queue.
Now I have bought a new Mac mini with M1 Apple Silicon and I was eager to see how the performance develops on that test machine. Well, it’s not bad but not as good as I expected:
System
Duration
iMac 2019, Intel 6-core i5, 3.0 GHz, Catalina macOS 10.15.7
19.95 s
Mac mini, M1 8-core, Big Sur macOS 11.2, Rosetta2
26.85 s
Mac mini, M1 8-core, Big Sur macOS 11.2, native ARM
17.07 s
Investigating a little bit further I noticed that at the start of the simulation all 8 cores of the M1 Mac are filled up properly but after a few seconds only the 4 high efficiency cores are used any more.
I have read the Apple docs „Optimize for Apple Silicon with performance and efficiency cores“ and double checked that the dispatch queue for the iterations is set up properly:
let simQueue = DispatchQueue.global(qos: .userInitiated)
But no success. After a few seconds of running the high performance cores are obviously not utilized any more. I even tried to set up the queue with qos set to .userInteracive up that didn’t help either. I also flagged the dispatch items with proper qos but that didn’t change anything. It looks to me that other apps (e.g. XCode) do utilize the high performance cores even for a longer time.
Does anybody know how to force a M1 Mac to utilize the high performance cores?
"M1 8 core" is really "M1 4 performance + 4 power saving cores". I expect it to have be a bit more performance than an Intel 6 core, but not much. Exactly has you see, 15% faster than six Intel cores or about as fast as 7 Intel cores would be. The current M1 chips are low end processors. "A bit better than Intel six cores" is quite good.
Your code must be running on the performance cores, otherwise there would be no chance at all to come close to the Intel performance. In that graph, nothing tells you which cores are used.
What happens most likely is that all cores start running, each trying to do one eighth of the work, and after about 8 seconds the performance cores have their work done. Then the power saving cores move their work to the performance cores. And you are just misinterpreting the image as only low performance cores doing the work.
I would guess that Apple has put a preference on using efficiency cores over performance for many reasons. Battery life being one, and most likely thermal reasons as well. This is the big question mark with a SoC that originally was designed for smartphones and tablets. MacOS is a much heavier OS then IOS or iPad OS. Apple most likely felt that performance cores were only needed in the cases where maximum throughput was needed. No doubt, I think some including myself with a M1 Mac Mini would like a way to adjust this balance between efficiency and performance cores. Personally overall, I would prefer all cores be capable of switching between efficiency and performance such as in Intel's Speed shift technology. This may come along with the M1's advancements in terms of Mac Pro models and other Pro models.

Performance check between shared cluster and laptop with Intel(R)Core™ i7

I am not really familiar with shared clusters, but I am assuming performance should not differ much in terms of completing a single task when compared with a laptop processor. I have a C++ code which I ran on my laptop with Intel(R)Core™ i7-4558U 2.80 GHz CPU and 16.0 GB RAM, with the operating system of 64 bit Windows 10. On the other hand, I have results of the same code from a publication which belong to the tests conducted on a shared cluster with Intel Xeon 2.3 GHz CPU and 4 GB memory limit with Linux operating system. The program uses CPLEX as the solver: my laptop has IBM Cplex 12.7 whereas previous runs used IBM CPLEX 12.4 (Cplex, 2012). My results seem to take 300 times more than the reported results of the previous run. Does this much difference make sense? If so what could be the driver behind it?
This could be attributed to performance variability (see, for example, section 5 of the MIPLIB 2010 paper here). In a nutshell, minor differences in problem formulation (e.g., order of constraints, input format, etc.), or running on different platforms, can have a great effect on the time to solve. With CPLEX 12.7, you can use the interactive to help you evaluate variability.

CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN not supported in the Intel/AMD OpenCL CPU runtime

Conditions:
I installed AMD OpenCL version AMD-APP-SDK-v2.8-lnx64 and Intel OpenCL version *intel_sdk_for_ocl_applications_xe_2013_r2_sdk_3.1.1.11385_x64* (version identification couldn't be more complex) according to the description on an HPC server with a dual socket Xeon E5-2650, Xeon Phi coprocessor, 64GB host memory and Red Hat Enterprise Server 6.4.
Problem description:
I would like to do device fission with OpenCL to get around the NUMA issue. Unfortunately the device (Intel CPU) or maybe the Linux kernel doesn't seem to support CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN. I tried both Intel OpenCL and AMD OpenCL. Although AMD OpenCL device query says that it supports the affinity domain option, actually it doesn't: when I try to run a code with CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN the clCreateSubDevices() function returns with -30 error code. I guess this is a bug in the current Intel OpenCL driver, according to a forum post.
Potential solution:
I thought that if I could select the first 16 parallel compute cores (8 cores + 8 hyper threads) (out of the total 32 parallel compute cores) those would map to the first socket. Unfortunately Intel OpenCL randomly distributes the 16 parallel compute cores across the 32 cores. AMD OpenCL on the other hand select the first 16 parallel compute cores, but the OpenCL compiler does a poor job on the kernel I'm running. So the no free lunch theorem applies here as well.
Questions:
Is there any way to specify which parallel compute cores the OpenCL should use for the computations?
Is there any way to overcome this NUMA issue with OpenCL?
Any comments regarding experiences with NUMA affinity are welcome.
Thank you!
UPDATE
A partial workaround, only applicable to single-socket testing:
(In Linux) Disable all cores from a NUMA node, so OpenCL ICD can only choose from hardware threads of the other NUMA node. Eg. on a 2 socket 32 HTT system:
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu31/online"
....
sudo sh -c "echo 0 > /sys/devices/system/cpu/cpu16/online"
I'm not sure if this hacking has no side effect, but so far it seems to work (for testing at least).

GPU vs CPU? Number of cores/threads in a GPU for program calculation acceleration?

I need some help understanding the concept of cores on a GPU vs. cores in a CPU for the purpose of doing parallel calculations.
When it comes to cores in a CPU, it seems pretty simple. I have a super intensive "for" loop that iterates four times. I have four cores in my Intel i5 2.26GHz CPU. I give one loop to each core. Each of the four loops is independent of the other. Boom - I now have four threads created and 100% CPU usage (instead of 25% CPU usage with only one core). My "for" loop now runs almost four times faster than it would have if I did not parallelize it. By the way, for the "for" loop, I was using the auto-parallelization available on Microsoft Visual Studio 2012, as in this online example:(http://msdn.microsoft.com/en-us/library/hh872235.aspx).
In contrast, I don't even know the number of cores in my laptop's GPU (Intel Graphics Media Accelerator HD, or Intel HD Graphics, with 1696MB shared memory) that I can use for parallel calculations. I don't even know a valid way of comparing the GPU to the CPU. When I see "12#500MHz" next to my graphics card description, I wonder if that means the graphics card has 12 cores for parallelization that can work kinda like the 4 cores in a CPU, except that the GPU cores run at 500MHz [slow] instead of 2.26GHz [fast]? Is there a GPU usage comparable to the CPU usage in Windows task manager? I'm an utter novice trying to use the C++ library in visual studio 2012, if that makes any difference. When I write the actual GPU software, the parallelization code looks like this:(http://msdn.microsoft.com/en-us/library/hh265137.aspx).
So, would you please fill some of the gaps or mistakes in my knowledge or help me compare the two? I don't need a super complicated answer, something as simple as "You can't compare a CPU core with a GPU core because of blankity blank" or "a GPU core isn't really a core like a CPU core is" would be very much appreciated.
First, the OS initiate more cores only if you ask for them in your code. Try using OpenMP or Win32 threads to achieve parallelism on your i5.
Second, the CPU clocking is more than GPU clocking. If the clocking of GPU is same as CPU, you can use it as a stove to cook. The cores in the GPU are more than CPU. There is a difference between a thread and core.
Third, I recommend you to read specifications and reference manuals for your CPU and GPU. Also, dont forget PCI-e. It is the bottleneck for Parallel Programming implementation.
Hope this clarifies your doubts. Any more questions, feel free to ask.

Resources