OpenGL: render time limit on linux - linux

I'm implementing some computation algorithm via OpenGL and Qt. All computations are executed in fragment shader.
Sometimes when i trying to execute some hard computations (that takes more than 5 seconds on GPU) OpenGL breaks computation before it ends. I suppose this is system like TDR from Windows.
I think that i should split input data by several parts but i need to know how long computation allowed.
How i can obtain render time limit on linux (it will be cool if there is crossplatform solution)?

I'm afraid this is not possible. After a lot of scouring through the documentation of both X and Wayland, I could not find anything mentioning GPU watchdog timer settings, so I believe this is driver-specific and likely inaccessible to the user (that or I am terrible at searching).
It is however possible to disable this watchdog under X on NVIDIA hardware by adding a line to your xorg.conf, which is then passed on to the graphics driver.
Option "Interactive" "boolean"
This option controls the behavior of the driver's watchdog, which attempts to detect and terminate GPU programs that get stuck, in order to ensure that the GPU remains available for other processes. GPU compute applications, however, often have long-running GPU programs, and killing them would be undesirable. If you are using GPU compute applications and they are getting prematurely terminated, try turning this option off.
Note that even the NVIDIA docs don't mention a numeric quantity for the timeout.

Related

What causes dma_map_page/dma_unmap_page to take longer time on some hardware?

I've been programming a Linux kernel module for several years for a PCIe device. One of the main feature is to transfer data from the PCIe card to the host memory using DMA.
I'm using streaming DMA, i.e. it's the user program that allocates the memory, and my kernel module has to do the job of locking the pages and creating the scatter gather structure. It works correctly.
However, when used on some more recent hardware with Intel processors, the function calls dma_map_page and dma_unmap_page are taking much longer time to execute.
I've tried to use dma_map_sg and dma_unmap_sg, it takes approximately the same longer-time.
I've tried to split the dma_unmap_sg into a first call to dma_sync_sg_for_cpu, followed by the call to dma_unmap_sg_attrs with attribute DMA_ATTR_SKIP_CPU_SYNC. It works correctly. And I can see the additional time is spend on the unmap operation, not on the sync.
I've tried to play with the linux command line parameters relating to the iommu (on, force, strict=0), and also intel_iommu, with no change in the behavior.
Some other hardware show a decent transfer rate, i.e. more than 6GB/s on PCIe3x8 (max 8GB/s).
The issue on some recent hardware is limiting transfer rate to ~3GB/s (I've checked that the card is correctly configured for PCIe3x8, and the programmer of the Windows device driver manages to achieve the 6GB/s on the same system. Things are more behind the curtains in Windows and I cannot get much information from him.)
On some hardware, the behavior is either normal or slowed, depending on the Linux distribution (and the Linux kernel version I guess). On some other hardware, the roles are reversed, i.e. the slow one becomes the fast one and vice-versa.
I cannot figure out the cause of this. Any clue?
The trouble was the bounce buffers. Didn't know about this.

How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU?

I'm using pyOpenCL to do some complex calculations.
It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB).
I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.
My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.
As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.
CPU doesn't face such difficulties for obvious reasons.
This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.
Windows
This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming
(best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.
This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.
Linux
(From Cuda Release Notes, but also applicable to OpenCL)
GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.
While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.
Mac
Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)

ImageMagick's display GPU "memory leak"?

I'm testing CUDA app and I have run into strange memory issue:
My program performs some image operations and displays it using ImageMagick's display program.
The problem is that every time I run that IM's display I get more GPU memory usage, so less memory for GPU computation.
I'm using IM's display, because I couldn't find anything that displays image from the pipe input. Any suggestions?
Anyway why IM's display takes so much GPU memory and why is it not freed?
Based on your question, you're attempting to display a series of files in sequence using a shell not unlike Bash after performing a set of GPU-intensive operations. You're curious why more GPU memory is being consumed with every subsequent invocation of ImageMagick display, which appears to be closing out successfully after the conclusion of each operation.
We may further theorize that you're using ImageMagick's OpenCL support for at least some of your processing. While we don't have enough information to determine what your GPU's texture buffers look like at the completion of each rendering via display, I speculate your GPU isn't freeing textures expediently, causing memory to slowly creep up.
Instead of continuing to build conjecture around this hypothesis, I will instead recommend a tool to debug your issue: gDEBugger. This should allow you to interrogate your video card to determine exactly why things are slowing down.
Best of luck with your application.
I know it's old, but we have figured out that using pipes (popen()) makes sophisticated copy of the program in memory, what also causes copying the end program directives, or whatever called... So when I close program opened with popen I also finish all CUDA related context that are usually freed in "background", when program ends. So cleaning CUDA memory after I close popen application won't work, and I thing here was my memory leak and general major program error.
I hope someone will find it useful.

Is there a way to independently task and use heterogenous multi gpus in a windows 7 system?

Can I have two mixed chipset/generation AMD gpus in my desktop; a 6950 and 4870, and dedicate one gpu (4870) for opencl/gpgpu purposes only, eliminating the device from video output or display driving consideration by the OS, allowing the 4870 to essentially remain in a deep sleep or appear ejected/disabled until it's stream processors are called upon?
Compared to the 4870, the 6950 is a heavyweight in opencl calculations; enough so that it can crunch numbers and still allow an active user session, and even web browsing. HOWEVER, as soon as I navigate to a webpage with embedded flash video, forget what I have running and open media player or media center- basically any gpu-accelerated video task that requires the 6950 to initialize UVD, the display system hangs.
I'm looking for a way to plug my 4870 in an open pcie slot, have it sit in a dormant state with near-0 heat production and power consumption (essentially only maintain the interface signalling, like an ethernet card in a powered-off desktop maintaining the line and waiting for a WOL command), and attain a D0 state (I don't even care if the latency of this wake event is on the scale of seconds) to then run opencl calculations ON ITS OWN. I do not wish to achieve a non-CF heterogeneous gpu teaming setup! In my example of a UVD hung situation I would see manually stopping the opencl calculations on the 6950, beginning those calculations then on the 4870 to free up the 6950 for multimedia usage/gaming as my desire outcome (granted, with a hit to the calculation rate). Even better if the two gpus could independently run similar calculations while no one is using the desktop. I don't even mind if I have to initiate the power-state transitions of the 4870 from/into an 'OFF' state (say, by a shortcut on the desktop), as long as it doesn't require a system restart, ending the user session and logging off... and the manual ON/OFF 'switch' for the 4870 is something any proficient windows end-user could do- like click a shortcut to run a script, or even go into device manage and toggle enable/disable. As long as the 4870 isn't wastefully idling by for 1 sole use that may occur sporadically.
I couldn't think of a solution to facilitate this function besides writing a new ini for the 4870 to override the typical power management characteristics written for usage of the device as a typical graphics card (say to drop in/out of powered state w/o relinquishing irq or other allocated resources to 'hold the door open' on interface availability and addressing). But that is an endeavor that is both well above my abilities, and I easily see an additional involvement of licensing being necessitated to achieve.
Windows 7 (and maybe windows 10) doesn't define a "selected device". It's softwares' own responsibility to pick the right device. For example, google chrome's add-on software(for video decode) will pick whatever gpu comes as first target defined in it. If it is written to pick first-indexed device, then it needs a pci-e re-plug of both cards to switch their roles.
This OS written to fit for majority(%99) of users, not for multi-gpu users(%1 ?). It simply picks one of gpus or software has explicit control over devices and simply benchmarks all gpus and picks fastest. So you should look for software's abilities instead of OS.
Same thing goes for games too! When I play dota-2 on vulkan api, it uses HD7870 for compute(of textures, particles, etc..) but uses R7-240 for graphics! But I prefer the opposite because R7-240 can't draw fast. Because this game is written for majority of people who don't have more than 1 gpu.
Money controls development I'm sorry for this. Then, market-penetration is needed for money. %99 market penetration needs writing code for public, not scientific guys or rich ones. Public has simply 1 gpu and that is a cheap one.
I wish this:
select 1 gpu for: unzipping files, wathing videos, compressing internet uploads and caching for file system(up to 2GB)
select another gpu for: gaming, opencl applications, mining, ..
select all gpus for: games, benchmarks, seen as single device by my applications,..
but is not guaranteed to become true because money still talks.
If I were a driver developer, I would add a "rename" option(and become poor in return) to give N virtual devices to OS, so OS and other software will be able to gain only 1/N 'th power of whole system or N/N by just using those renames or main devices. A rename could be a single compute unit of a gpu. When OS tells drivers "give me %25 of all cores that share same memory" so it pick a device and gives %25 of total cores of system. So even users could create renames for their own work.
I even sent a message to microsoft for "file system cache on my 2nd graphics card" but they did not return!

CUDA/PyCUDA: Which GPU is running X11?

In a Linux system with multiple GPUs, how can you determine which GPU is running X11 and which is completely free to run CUDA kernels? In a system that has a low powered GPU to run X11 and a higher powered GPU to run kernels, this can be determined with some heuristics to use the faster card. But on a system with two equal cards, this method cannot be used. Is there a CUDA and/or X11 API to determine this?
UPDATE: The command 'nvidia-smi -a' shows a whether a "display" is connected or not. I have yet to determine if this means physically connected, logically connected (running X11), or both. Running strace on this command shows lots of ioctls being invoked and no calls to X11, so assuming that the card is reporting that a display is physically connected.
There is a device property kernelExecTimeoutEnabled in the cudaDeviceProp structure which will indicate whether the device is subject to a display watchdog timer. That is the best indicator of whether a given CUDA device is running X11 (or the windows/Mac OS equivalent).
In PyCUDA you can query the device status like this:
In [1]: from pycuda import driver as drv
In [2]: drv.init()
In [3]: print drv.Device(0).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
1
In [4]: print drv.Device(1).get_attribute(drv.device_attribute.KERNEL_EXEC_TIMEOUT)
0
Here device 0 has a display attached, and device 1 is a dedicated compute device.
I don't know any library function which could check that. However a one "hack" comes in mind:
X11, or any other system component that manages a connected monitor must consume some of the GPU memory.
So, check if both devices report the same amount of available global memory through 'cudaGetDeviceProperties' and then check the value of 'totalGlobalMem' field.
If it is the same, try allocating that (or only slightly lower) amount of memory on each of the GPU and see which one fails to do that (cudaMalloc returning an error flag).
Some time ago I read somewhere (I don't remember where) that when you increase your monitor resolution, while there is an active CUDA context on the GPU, the context may get invalidated. That hints that the above suggestion might work. Note however that I never actually tried it. It's just my wild guess.
If you manage to confirm that it works, or that it doesn't, let us know!

Resources