CUDA performance penalty when running in Windows

CUDA performance penalty when running in Windows - linux

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API.
To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).
Consider first the kernel launch time when running in Linux:
Next, consider the launch time when running in Windows:
This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?
I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.

There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.
As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.
The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)
AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg#NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.
As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.
EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);
EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).
Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.

Even it's been almost 3 years since the issue has been active, I still consider it necesssary to provide my findings.
I've been in the same situation: the same cuda programme elapsed for 5ms in Ubuntu cuda 8.0 while over 30ms in Windows 10 cuda 10.1. Both with GTX 1080Ti.
However, in Windows when I changed the compiler from VS Studio to cmd's nvcc compiler suddenly the programme was boosted to the same speed as the Linux one.
This suggests that maybe the problem comes from Visual Studio.

Related

HoloLens 2 Emulator visual updates extremely slow

I installed the latest version of the HoloLens 2 Emulator (10.0.20348.1501) on my Windows 10 Pro machine. I have 32GB of RAM, 11th Gen Intel 8 Core CPU, Nvidia 3080 (mobile) graphics card.
Initially I thought that the HoloLens emulator was super slow (an input such as trying to move the pointer can take 10, 20, 30 seconds to show up and sometimes doesn't even show up).
But upon testing some more, I've realized that my inputs are going through immediately (as I can tell from the sound feedback), it's just the visual feedback which is not updating. This testing is just inside the OS (without trying to launch an app I developed).
Any ideas what could be going on? In the performance monitoring tool, everything looks fine.

In the end, the only way to fix it, was to disable graphics switching in the BIOS, and set to Discrete only - despite the fact that the Nvidia GPU Activity shows that the GPU turns on when I launch the emulator.

If the emulator takes 10 seconds to update the graphic, there should be configurations issues. Based on my test, though I cannot say it works fluently in my PC, the HoloLens 2 emulator runs at around 15 fps. There is delay but should be work fine for testing. (I am running it with Nvidia 1080 (mobile), with a much older CPU than yours.)
Please check the document on Using the HoloLens Emulator - Mixed Reality | Microsoft Docs and make sure you have configured your computer properly.
In BIOS
Intel VT -> enabled
Intel VT-d -> disabled
Hardware-based Data Execution Prevention (DEP) (or any Intel data protection related feature, display name could be varied) -> disabled
In Windows
After BIOS configuration is done, completely shut down your PC, then boot. (Directly reboot may not apply changes).
Run dxdiag to check:
DirectX 11.0 or later (12.0 in my PC)
WDDM 2.5 graphics driver or later (3.0 in my PC)
Hyper-V Checking
Enable it if it is not. Reboot is required.
If it is already enabled. Disable it -> reboot the PC -> enable it again -> reboot
Others
For the laptop, make sure the power supply is plug-in and it is not in power-save mode. Check the GPU payload (around 36% in Nvidia 1080 mobile)
Then you may run the emulator again to see if this issue still exists.

Nvidia display driver stop working frequently

I have dual booted windows 7 and ubuntu 14.04 on my PC.
I have a recurring problem with windows.
The screen frequently becomes blank for a few seconds, showing an error message in a popup menu:
"Display driver stopped responding and has recovered. Display driver NVIDIA windows kernel mode driver version 266.58 stopped responding and has successfully recovered."
Here are my computer specifications:
Intel core i5 processor,
4gb ram,
Nvidia GeForce 210 graphics card.
I updated the drivers on my computer.
I also formatted my PC, but the problem still persists.
Now the problem is worse and windows shuts down within a few minutes of starting.
Today, Ubuntu also started randomly freezing, a symptom which had not presented itself until now.

As Astor139 said:
Honestly, this particular question doesn't fit stack overflow, since it isn't strictly programming related. (As far as I can tell, you have a hardware issue.) Since it persists across two different OS, with very different arch, I would say you need a new gpu. A Nvidia GT 730 is under $50 USD and would be a suitable replacement/upgrade for your 200.
Posted as his comment is really a suitable answer.

Is IIS blocking calls to cuda from my web app?

I have an asp.net mvc 4 x64 web app that in the background does some calculations and returns some numbers to be rendered in the browser. All works fine in visual studio but when called from the project folder from the browser via IIS I get a CudaErrorNoDevice. This is error number 38 and so it does look like it's referencing all the external cuda dlls correctly, making the call and returning the error.
For testing I'm using the GetDeviceProperties() method.
I even plugged the Gpu into the displays just in case the browser got confused that the cuda call was for graphics. No luck though.
Can anyone confirm that calling the Gpu from a web app is a perfectly do-able thing to do? And if so, is there any specific configuration needed in IIS for Gpu's.
Thanks
IIS 8 Express, VS2012, Cuda 5.0, Gtx Titan (This is a 2nd Gpu, Gtx 660 is for display).

It's possible that IIS is running at a service level that does not have access to the GPU (which is a WDDM device in this scenario.)
The usual suggestion would be to switch the GPU device to be in TCC mode (possible with most Quadro and Tesla GPUs), but that is not possible with a GeForce GPU (both of yours are GeForce GPUs).
As an alternative workaround, you may wish to try the method described here.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.

Linux stuck in CPU soft lockup?

My system is a CentOS 6.3 (running Kernel version 2.6.32-279.el6.x86_64).
I have a loadable kernel module which is a driver that manages a PCIe card.
If I manually insert the driver using insmod while the OS is up and running, the driver loads successfully and is operational.
However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.
BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).
I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:
I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.

I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.
It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.
Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.
Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo before and after the update.

Can I run CUDA on Intel's integrated graphics processor?

I have a very simple Toshiba Laptop with i3 processor. Also, I do not have any expensive graphics card. In the display settings, I see Intel(HD) Graphics as display adapter. I am planning to learn some cuda programming. But, I am not sure, if I can do that on my laptop as it does not have any nvidia's cuda enabled GPU.
In fact, I doubt, if I even have a GPU o_o
So, I would appreciate if someone can tell me if I can do CUDA programming with the current configuration and if possible also let me know what does Intel(HD) Graphics mean?

At the present time, Intel graphics chips do not support CUDA. It is possible that, in the nearest future, these chips will support OpenCL (which is a standard that is very similar to CUDA), but this is not guaranteed and their current drivers do not support OpenCL either. (There is an Intel OpenCL SDK available, but, at the present time, it does not give you access to the GPU.)
Newest Intel processors (Sandy Bridge) have a GPU integrated into the CPU core. Your processor may be a previous-generation version, in which case "Intel(HD) graphics" is an independent chip.

Portland group have a commercial product called CUDA x86, it is hybrid compiler which creates CUDA C/ C++ code which can either run on GPU or use SIMD on CPU, this is done fully automated without any intervention for the developer. Hope this helps.
Link: http://www.pgroup.com/products/pgiworkstation.htm

If you're interested in learning a language which supports massive parallelism better go for OpenCL since you don't have an NVIDIA GPU. You can run OpenCL on Intel CPUs, but at best you can learn to program SIMDs.
Optimization on CPU and GPU are different. I really don't think you can use Intel card for GPGPU.

Intel HD Graphics is usually the on-CPU graphics chip in newer Core i3/i5/i7 processors.
As far as I know it doesn't support CUDA (which is a proprietary NVidia technology), but OpenCL is supported by NVidia, ATi and Intel.

in 2020 ZLUDA was created which provides CUDA API for Intel GPUs. It is not production ready yet though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string