I have identified a memory leak in matplotlib.imshow. I am aware of similar questions (like Excessive memory usage in Matplotlib imshow) and I've read the related ironpython thread (https://github.com/ipython/ipython/issues/1623/).
I believe that the code below should (in the absence of a memory leak) consume a constant amount of memory while running. Instead, it grows with each iteration.
I'm running the most recent version I can find (matplotlib-1.2.0rc3.win32-py2.7 and numpy-1.7.0.win32-py2.7), and the problem remains. I'm not keeping the return value of imshow, and in fact I'm explicitly deleting it, so I think the note in IronPython discussion doesn't apply. The behavior is identical with and without the explicit assignment-and-del inside the loop.
I see the same behavior with matplotlib-1.2.0.win32-py2.7.
Each iteration seems to hang onto whatever memory was needed for the image. I've
chosen a large (1024x1024) random matrix to make the size of each image interestingly large.
I'm running Win7 pro with 2G of physical RAM, 32-bit python2.7.3 (hence the memory error), and the above numpy and matplotlib packages. The code below fails with a memory error in iteration 440 or so. The windows task manager reports consumption of 1,860,232K when it fails.
Here is code that demonstrates the leak:
IMAGE_SIZE = 1024
import random
RANDOM_MATRIX = []
for i in range(IMAGE_SIZE):
RANDOM_MATRIX.append([random.randint(0, 100) for each in range(IMAGE_SIZE)])
def exercise(aMatrix, aCount):
for i in range(aCount):
anImage = imshow(aMatrix, origin='lower left', vmin=0, vmax=100)
del(anImage)
if __name__=='__main__':
from pylab import *
exercise(RANDOM_MATRIX, 4096)
I can presumably render the image with PIL instead matplotlib. In the absence of a workaround, I do think this is a show-stopper for matplotlib.
I struggled to make it work because many post talk about this problem, but no one seems to care about providing a working example.
first of all, you should never use the from ... import * syntax, when using a library you didn't make yourself - because, you can never be sure it doesn't declare a symbol which would conflict with yours.
Then, calling set_data is not sufficient to solve this problem - for three reasons :
You didn't mention where this set_data is called from. It is not a
normal function but a method from an object... Which object ?
set_data alone won't be sufficient if you do not have something to
"activate" the changes. Sometimes it will happen transparently
because another plot activates it, but if it doesn't, you will need
to call flush_events() by yourself.
Set data won't work if you haven't called imshow() with values it
can use to setup it's color map.
Here is a working solution (link):
I think I found a workaround, I didn't fully realize how heavyweight imshow is.
The answer is to call imshow just once, then call set_data with RANDOM_MATRIX for each subsequent image.
Problem solved!
Related
I am evaluating the tools that profile my python program. One of the interesting tools here is memory_profiler. Before moving forward, just want to know whethermemory_profiler affects runtime. The reason I am asking this question is that memory_profiler will output a lot of memory usages. So I am suspecting it might affect runtime.
Thanks
Derek
It depends how you are using memory_profiler. This can be used in two different ways:
To get memory usage line-by-line (run with python -m memory_profiler my_script.py). This needs to get memory information (from the OS) for every line executed within the profiled function. How this affects run-time depends on the amount of lines in the function: if it has a lot of lines with fast execution times, it might suppose a significant overhead. On the other hand, if the function to profile has few lines, and each lines has a significant computing time, then the overhead will be negligible.
To get memory as a function of time (run with mprof run my_script.py and plot with mprof plot). In this case the function that collects the memory usage is in a different process as the one that runs your script, hence the overhead is minimal (unless you are using all CPUs).
I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.
I'm using pyOpenCL to do some complex calculations.
It runs fine on CPU, but I get an error when trying to run it on an NVIDIA GeForce 9400M (256 MB).
I'm working on Mac OS X Lion (10.7.5)
The strange thing is that this error does not always show up. It seems to occur when my calculations use larger numbers (resulting in larger iterations) but only when run on GPU.
I'm not writing to memory locations I'm not supposed to write to. I ruled out possible problems with concurrent modification by running the code as a single work item.
I simplified my OpenCL code as much as possible, and from what was left created some very simple code with extremely weird behavior that causes the pyopencl.LogicError. It consists of 2 nested loops in which a couple of assignments are made to the result array. This assignment need not even depend on the state of the loop.
This is run on a single thread (or work item, shape = (1,)) on the GPU.
__kernel void weirdError(__global unsigned int* result){
unsigned int outer = (1<<30)-1;
for(int i=20; i--; ){
unsigned int inner = 0;
while(inner != outer){
result[0] = 1248;
result[1] = 1337;
inner++;
}
outer++;
}
}
The strange part is that removing either one of the assignments to the result array removes the error. Also, decreasing the initial value for outer (down to (1<<20)-1 for example) also removes the error. In these cases, the code returns normally, with the correct result available in the corresponding buffer.
On CPU, it never raises an error.
The OpenCL code is run from Python using PyOpenCL.
Nothing fancy in the setup:
platform = cl.get_platforms()[0]
device = platform.get_devices(cl.device_type.GPU)[0]
context = cl.Context([device])
program = cl.Program(context, getProgramCode()).build()
queue = cl.CommandQueue(context)
In this Python code I set the result_buf to 0, then I run the calculation in OpenCL that will set its values in a large iteration. Afterwards I try to collect this value from the device memory, but that's where it goes wrong:
result = numpy.zeros(2, numpy.uint32)
result_buf = cl.Buffer(context, mem_flags.READ_WRITE | mem_flags.COPY_HOST_PTR, hostbuf=result)
shape = (1,)
program.weirdError(queue, shape, None, result_buf)
cl.enqueue_copy(queue, result, result_buf)
The last line gives me:
pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
How can this repeated assignment cause an error?
And more importantly: how can it be avoided?
I understand that this problem is probably platform dependent, and thus perhaps hard to reproduce. But this is the only machine I have access to, so the code should work on this machine.
DISCLAIMER: I have never worked with OpenCL (or CUDA) before. I wrote the code on a machine where the GPU did not support OpenCL. I always tested it on CPU. Now that I switched to GPU, I find it frustrating that errors do not occur consistently and I have no idea why.
My advice is to avoid such a long loops inside a kernel. Work Item is making over 1 billion of iterations, and that's a long shot. Probably, driver kills your kernel as it takes too much time to execute. Reduce the number of iterations to the maximal amount, which doesn't lead to error and look at the execution time. If it takes something like seconds - that's too much.
As you said, reducing iterations numbers solves the problem and that's the evidence in my opinion. Reducing the number of assignment operations also makes kernel runs faster as IO operations are usually the slowest.
CPU doesn't face such difficulties for obvious reasons.
This timeout problem can be fixed in Windows and Linux, but apparently not in Mac.
Windows
This answer to a similar question (explaining the symptoms in Windows) tells both what is going on and how to fix it:
This is a known "feature" under Windows (not sure about Linux) - if the video driver stops responding, the OS will reset it. Except that, since OpenCL (and CUDA) is implemented by the driver, a kernel that takes too long will look like a frozen driver. There is a watchdog timer that keeps track of this (5 seconds, I believe).
Your options are:
You need to make sure that your kernels are not too time-consuming
(best).
You can turn-off the watchdog timer: Timeout Detection and Recovery of GPUs.
You can run the kernel on a GPU that is not hooked up to a display.
I suggest you go with 1.
This answer explains how to actually do (2) in Windows 7. But the MSDN-page for these registry keys mentions they should not be manipulated by any applications outside targeted testing or debugging. So it might not be the best option, but it is an option.
Linux
(From Cuda Release Notes, but also applicable to OpenCL)
GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommeded that CUDA is run on a GPU that is NOT attached to an X display.
While X does not need to be running in order to use CUDA, X must have been initialized at least once after booting in order to properly load the NVIDIA kernel module. The NVIDIA kernel module remains loaded even after X shuts down, allowing CUDA to continue to function.
Mac
Apple apparently does not allow fiddling with this watchdog and thus the only option seems to be using a second GPU (without a screen attached to it)
How can I increase opencv video FPS in Linux on Intel atom? The video seems lagging when processing with opencv libraries.
Furthermore, i m trying to execute a program/file with opencv
system(/home/file/image.jpg);
however, it shows Access Denied.
There are several things you can do to improve performance. Using OpenGL, GPUs, and even just disabling certain functions within OpenCV. When you capture video you can also change the FPS default which is sometimes set low. If you are getting access denied on that file I would check the permissions, but without setting the full error it is hard to figure out.
First is an example of disabling conversion and the second is setting the desired FPS. I think these defines are changed in OpenCV 3 though.
cap.set(CV_CAP_PROP_CONVERT_RGB , false);
cap.set(CV_CAP_PROP_FPS , 60);
From your question, it seems you have a problem that your frame buffer is collecting a lot of frames which you are not able to clear out before reaching to the real-time frame. i.e. a frame capture now, is processed several seconds later. Am I correct in understanding?
In this case, I'd suggest couple of things,
Use a separate thread to grab the frames from VideoCapture and then push these frames into a queue of a limited size. Of course this will lead to missing frames, but if you are interested in real time processing then this cost is often justified.
If you are using OOP, then I may suggest using a separate thread for each object, as this significantly speeds up the processing. You can see several fold increase depending on the application and functions used.
I'm doing some rather long computations, which can easily span a few days. In the course of these computations, sometimes Mathematica will run out of memory. To this end, I've ended up resorting to something along the lines of:
ParallelEvaluate[$KernelID]; (* Force the kernels to launch *)
kernels = Kernels[];
Do[
If[Mod[iteration, n] == 0,
CloseKernels[kernels];
LaunchKernels[kernels];
ClearSystemCache[]];
(* Complicated stuff here *)
Export[...], (* If a computation ends early I don't want to lose past results *)
{iteration, min, max}]
This is great and all, but over time the main kernel accumulates memory. Currently, my main kernel is eating up roughly 1.4 GB of RAM. Is there any way I can force Mathematica to clear out the memory it's using? I've tried littering Share and Clear throughout the many Modules I'm using in my code, but the memory still seems to build up over time.
I've tried also to make sure I have nothing big and complicated running outside of a Module, so that something doesn't stay in scope too long. But even with this I still have my memory issues.
Is there anything I can do about this? I'm always going to have a large amount of memory being used, since most of my calculations involve several large and dense matrices (usually 1200 x 1200, but it can be more), so I'm wary about using MemoryConstrained.
Update:
The problem was exactly what Alexey Popkov stated in his answer. If you use Module, memory will leak slowly over time. It happened to be exacerbated in this case because I had multiple Module[..] statements. The "main" Module was within a ParallelTable where 8 kernels were running at once. Tack on the (relatively) large number of iterations, and this was a breeding ground for lots of memory leaks due to the bug with Module.
Since you are using Module extensively, I think you may be interested in knowing this bug with non-deleting temporary Module variables.
Example (non-deleting unlinked temporary variables with their definitions):
In[1]:= $HistoryLength=0;
a[b_]:=Module[{c,d},d:=9;d/;b===1];
Length#Names[$Context<>"*"]
Out[3]= 6
In[4]:= lst=Table[a[1],{1000}];
Length#Names[$Context<>"*"]
Out[5]= 1007
In[6]:= lst=.
Length#Names[$Context<>"*"]
Out[7]= 1007
In[8]:= Definition#d$999
Out[8]= Attributes[d$999]={Temporary}
d$999:=9
Note that in the above code I set $HistoryLength = 0; to stress this buggy behavior of Module. If you do not do this, temporary variables can still be linked from history variables (In and Out) and will not be removed with their definitions due to this reason in more broad set of cases (it is not a bug but a feature, as Leonid mentioned).
UPDATE: Just for the record. There is another old bug with non-deleting unreferenced Module variables after Part assignments to them in v.5.2 which is not completely fixed even in version 7.0.1:
In[1]:= $HistoryLength=0;$Version
Module[{L=Array[0&,10^7]},L[[#]]++&/#Range[100];];
Names["L$*"]
ByteCount#Symbol##&/#Names["L$*"]
Out[1]= 7.0 for Microsoft Windows (32-bit) (February 18, 2009)
Out[3]= {L$111}
Out[4]= {40000084}
Have you tried to evaluate $HistoryLength=0; in all subkernels and as well as in the master kernel? History tracking is the most common source for going out of memory.
Have you tried do not use slow and memory-consuming Export and use fast and efficient Put instead?
It is not clear from your post where you evaluate ClearSystemCache[] - in the master kernel or in subkernels? It looks like you evaluate it in the master kernel only. Try to evaluate it in all subkernels too before each iteration.