How can a large number of assignments to the same array cause a pyopencl.LogicError when run on GPU? - python-3.x

I'm using pyOpenCL to do some complex calculations.
It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB).
I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.

My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.
As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.
CPU doesn't face such difficulties for obvious reasons.

This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.
Windows
This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming
(best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.
This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.
Linux
(From Cuda Release Notes, but also applicable to OpenCL)
GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.
While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.
Mac
Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)

Related

CPU/Threads usage on M1 Pro (Apple Silicon) using openMP

hope someone knows the answer to this...
I have a code that compiles perfectly well with openMP (it uses libsharp). However, I am finding it impossible to make the M1 Pro chip use all the 8 or 10 cores I have.
I am setting the threads variable correctly as export OMP_NUM_THREADS=10 such that the code correctly identifies it's supposed to be running with 10 threads (see image below showing a print-screen from my activity monitor):
Activity Monitor Print Screen
Print screen is showing that the code is compiled for Apple Silicon, uses 10 threads but not much of the CPU available.
Does anyone know how to properly compile/set the number of threads such that all the cores will be used?
This is trivial in x86 architectures.
Not really an answer, but long for a comment...
If both LLVM and GCC behave the same then it's not an OpenMP runtime issue. (And your monitor output shows that the correct number of threads have been created). I'm also not certain that it's really an Arm issue.
Are you comparing with an Apple x86 machine (so running the same operating system), or with a Linux x86 system?
The scheduling decisions of the two OSes are likely different, and (for instance) MacOS has no interface to bind threads to logicalCPUs.
As well as that, there's the issue of having some fast and some slow cores. That could mean that statically scheduled loops are inefficient.
I'm also confused by the fact that you arm to show multiple instances of your code running at the same time, so you are explicitly causing over-subscription of the logicalCPUs...

Code working on windows but launch failures on Linux

First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.

OpenGL: render time limit on linux

I'm implementing some computation algorithm via OpenGL and Qt. All computations are executed in fragment shader.
Sometimes when i trying to execute some hard computations (that takes more than 5 seconds on GPU) OpenGL breaks computation before it ends. I suppose this is system like TDR from Windows.
I think that i should split input data by several parts but i need to know how long computation allowed.
How i can obtain render time limit on linux (it will be cool if there is crossplatform solution)?
I'm afraid this is not possible. After a lot of scouring through the documentation of both X and Wayland, I could not find anything mentioning GPU watchdog timer settings, so I believe this is driver-specific and likely inaccessible to the user (that or I am terrible at searching).
It is however possible to disable this watchdog under X on NVIDIA hardware by adding a line to your xorg.conf, which is then passed on to the graphics driver.
Option "Interactive" "boolean"
This option controls the behavior of the driver's watchdog, which attempts to detect and terminate GPU programs that get stuck, in order to ensure that the GPU remains available for other processes. GPU compute applications, however, often have long-running GPU programs, and killing them would be undesirable. If you are using GPU compute applications and they are getting prematurely terminated, try turning this option off.
Note that even the NVIDIA docs don't mention a numeric quantity for the timeout.

Reducing memory usage in an extended Mathematica session

I'm doing some rather long computations, which can easily span a few days. In the course of these computations, sometimes Mathematica will run out of memory. To this end, I've ended up resorting to something along the lines of:
ParallelEvaluate[$KernelID]; (* Force the kernels to launch *)
kernels = Kernels[];
Do[
If[Mod[iteration, n] == 0,
CloseKernels[kernels];
LaunchKernels[kernels];
ClearSystemCache[]];
(* Complicated stuff here *)
Export[...], (* If a computation ends early I don't want to lose past results *)
{iteration, min, max}]
This is great and all, but over time the main kernel accumulates memory. Currently, my main kernel is eating up roughly 1.4 GB of RAM. Is there any way I can force Mathematica to clear out the memory it's using? I've tried littering Share and Clear throughout the many Modules I'm using in my code, but the memory still seems to build up over time.
I've tried also to make sure I have nothing big and complicated running outside of a Module, so that something doesn't stay in scope too long. But even with this I still have my memory issues.
Is there anything I can do about this? I'm always going to have a large amount of memory being used, since most of my calculations involve several large and dense matrices (usually 1200 x 1200, but it can be more), so I'm wary about using MemoryConstrained.
Update:
The problem was exactly what Alexey Popkov stated in his answer. If you use Module, memory will leak slowly over time. It happened to be exacerbated in this case because I had multiple Module[..] statements. The "main" Module was within a ParallelTable where 8 kernels were running at once. Tack on the (relatively) large number of iterations, and this was a breeding ground for lots of memory leaks due to the bug with Module.
Since you are using Module extensively, I think you may be interested in knowing this bug with non-deleting temporary Module variables.
Example (non-deleting unlinked temporary variables with their definitions):
In[1]:= $HistoryLength=0;
a[b_]:=Module[{c,d},d:=9;d/;b===1];
Length#Names[$Context<>"*"]
Out[3]= 6
In[4]:= lst=Table[a[1],{1000}];
Length#Names[$Context<>"*"]
Out[5]= 1007
In[6]:= lst=.
Length#Names[$Context<>"*"]
Out[7]= 1007
In[8]:= Definition#d$999
Out[8]= Attributes[d$999]={Temporary}
d$999:=9
Note that in the above code I set $HistoryLength = 0; to stress this buggy behavior of Module. If you do not do this, temporary variables can still be linked from history variables (In and Out) and will not be removed with their definitions due to this reason in more broad set of cases (it is not a bug but a feature, as Leonid mentioned).
UPDATE: Just for the record. There is another old bug with non-deleting unreferenced Module variables after Part assignments to them in v.5.2 which is not completely fixed even in version 7.0.1:
In[1]:= $HistoryLength=0;$Version
Module[{L=Array[0&,10^7]},L[[#]]++&/#Range[100];];
Names["L$*"]
ByteCount#Symbol##&/#Names["L$*"]
Out[1]= 7.0 for Microsoft Windows (32-bit) (February 18, 2009)
Out[3]= {L$111}
Out[4]= {40000084}
Have you tried to evaluate $HistoryLength=0; in all subkernels and as well as in the master kernel? History tracking is the most common source for going out of memory.
Have you tried do not use slow and memory-consuming Export and use fast and efficient Put instead?
It is not clear from your post where you evaluate ClearSystemCache[] - in the master kernel or in subkernels? It looks like you evaluate it in the master kernel only. Try to evaluate it in all subkernels too before each iteration.

select() inside infinite loop uses significantly more CPU on RHEL 4.8 virtual machine than on a Solaris 10 machine

I have a daemon app written in C and is currently running with no known issues on a Solaris 10 machine. I am in the process of porting it over to Linux. I have had to make minimal changes. During testing it passes all test cases. There are no issues with its functionality. However, when I view its CPU usage when 'idle' on my Solaris machine it is using around .03% CPU. On the Virtual Machine running Red Hat Enterprise Linux 4.8 that same process uses all available CPU (usually somewhere in the 90%+ range).
My first thought was that something must be wrong with the event loop. The event loop is an infinite loop (while(1)) with a call to select(). The timeval is setup so that timeval.tv_sec = 0 and timeval.tv_usec = 1000. This seems reasonable enough for what the process is doing. As a test I bumped the timeval.tv_sec to 1. Even after doing that I saw the same issue.
Is there something I am missing about how select works on Linux vs. Unix? Or does it work differently with and OS running on a Virtual Machine? Or maybe there is something else I am missing entirely?
One more thing I am not sure which version of vmware server is being used. It was just updated about a month ago though.
I believe that Linux returns the remaining time by writing it into the time parameter of the select() call and Solaris does not. That means that a programmer who isn't aware of the POSIX spec might not reset the time parameter between calls to select.
This would result in the first call having 1000 usec timeout and all other calls using 0 usec timeout.
As Zan Lynx said, the timeval is modified by select on linux, so you should reassign the correct value before each select call. Also I suggest to check if some of the file descriptor is in a particular state (e.g. end of file, peer connection closed...). Maybe the porting is showing some latent bug in the analisys of the returned values (FD_ISSET and so on). It happened to me too some years ago in a port of a select-driven cycle: I was using the returned value in the wrong way, and a closed fd was added to the rd_set, causing select to fail. On the old platform the wrong fd was used to have a value higher than maxfd, so it was ignored. Because of the same bug, the program didn't recognize the select failure (select() == -1) and looped forever.
Bye!

Resources