queue reader with tf.py_func yields memory leak - memory-leaks

I am trying to write a queue-reader that walks through a large file and runs a python function on each line before passing it to the actual operation.
I use a string_input_producer to read for a single .tsv file. Then I create a queue with tf.TextLineReader and enhance each line with tf.py_func. Doing this, I notice some memory leakage that only comes into effect if tf.py_func is called (yes, even as a noop).
Running the below code yields the following result:
$ python test_memory.py 2> /dev/null
run WITHOUT tf.py_func
00001/50000, 1.4260% mem
05001/50000, 1.4512% mem
10001/50000, 1.4512% mem
15001/50000, 1.4512% mem
20001/50000, 1.4512% mem
25001/50000, 1.4516% mem
30001/50000, 1.4516% mem
35001/50000, 1.4516% mem
40001/50000, 1.4516% mem
45001/50000, 1.4516% mem
50000/50000, 1.4516% mem
===========================
run WITH tf.py_func
00001/50000, 1.4975% mem
05001/50000, 1.5051% mem
10001/50000, 1.5066% mem
15001/50000, 1.5081% mem
20001/50000, 1.5110% mem
25001/50000, 1.5137% mem
30001/50000, 1.5148% mem
35001/50000, 1.5165% mem
40001/50000, 1.5195% mem
45001/50000, 1.5210% mem
50000/50000, 1.5235% mem
===========================
As you can see, running the code without tf.py_func keeps the used memory stable, whereas running it WITH the python function makes it constantly increase. this effect is way more pronounced on files with larger rows.
test_memory.py:
import os
import sys
import psutil
import tensorflow as tf
def py_funner(x, do_py=True):
'''
this function returns the exact input.
if do_py==True, it passes the data through a python noop using tf.py_func
'''
if do_py:
def py_func(y):
# this is just another noop.
return y
# py_func wraps a python function as a tensorflow op.
return tf.py_func(py_func, [x], [tf.string], stateful=False)[0]
else:
return x
def get_data(do_py=True):
# take the code as input. the effect is way more pronounced on larger files,
# e.g., a tsv that encode image data in base64, as for ms-celeb-1m
in_str = os.__file__
# produce a queue that reads the one file row by row.
input_queue = tf.train.string_input_producer([in_str])
reader = tf.TextLineReader()
ind, row = reader.read(input_queue)
# call the wrapper to either include tf.py_func or not.
return py_funner(row, do_py=do_py)
def main():
# get the current proccess to monitor memory usage
process = psutil.Process(os.getpid())
# execute the same code both with a tf.py_func noop and without it
for tt in [False, True]:
print 'run WITH%s tf.py_func'%('' if tt else 'OUT')
# generate the data queue
data = get_data(do_py=tt)
# start the session and the queue coordinator
sess = tf.Session()
coord = tf.train.Coordinator()
queue_threads = tf.train.start_queue_runners(sess, coord=coord)
# read a lot of the file
max_iter = 50000
for i in range(max_iter):
run_ops = [data]
d = sess.run(run_ops)
mem = process.memory_percent()
print '\r%05d/%d, %.4f%% mem'%(i+1, max_iter, mem),
sys.stdout.flush()
if i%5000==0:
print
print '\n==========================='
if __name__=='__main__':
main()
I'm grateful for any kind of pointers or ideas how to further debug this?! Maybe is there a way to see whether the python function keeps a storage of some kind?
thanks!!

Related

Memory used value in pytorch is different

I have a question in the process of learning using pytorch.
referece: https://pytorch.org/docs/stable/cuda.html
current_memory = torch.cuda.memory_allocated(device=device) # return current GPU memory
total_free_memory = torch.cuda.mem_get_info(device=device) # returns the total, unused GPU memory
total = mem_get_info[1]
unused = mem_get_info[0]
used_memory = total - unused
In my code,
total : 11.77GB
used_memory: 4.63GB
current_memory: 3.1GB
I wonder the used_memory - current_memory is a leak memory.
why are the values used_memory and current_memory have different??
Thanks for any help.

How to free memory from Gst.Buffer.extract_dup in Python using GLib.free()

I am running object detection on buffers from gstreamer and am utilizing gst_buffer_extract_dup to create an image array from a Gstreamer buffer. Here is a code snip:
gstbuffer = gstsample.get_buffer()
caps_format = gstsample.get_caps().get_structure(0) # Gst.Structure
frmt_str = caps_format.get_value('format')
video_format = GstVideo.VideoFormat.from_string(frmt_str)
p, q = caps_format.get_value('width'), caps_format.get_value('height')
buf = gstbuffer.extract_dup(0, gstbuffer.get_size())
array = np.ndarray(shape=(q, p, 3), \
buffer = buf, \
dtype='uint8')
svg = self.user_function(gstbuffer, array, self.src_size, self.get_box())
I have discovered a substantial memory leak causing the program to crash within 10 minutes and have identified extract_dup as the likely cause as the GStreamer documentation says it needs to be freed with g_free. The (potential) problem is that I cannot figure out the syntax for doing this. trying GLib.free(buf) results in the error "GLib.free(buf)
ValueError: Pointer arguments are restricted to integers, capsules, and None. See: https://bugzilla.gnome.org/show_bug.cgi?id=683599
"
How would I free this memory? Furthermore, how can I confirm that this memory isn't being freed and is the cause of my leak?

opencl speed and OUT_OF_RESOURCES

I am very new to opencl and trying my first program. I implemented a simple sinc filtering of waveforms. The code works, however i have two questions:
Once I increase the size of the input matrix (numrows needs to go up to 100 000) I get (clEnqueueReadBuffer failed: OUT_OF_RESOURCES) even though matrix is relatively small (few mb). This is to some extent related to the work group size I think, but could someone elaborate how I could fix this issue ?
Could it be driver issue ?
UPDATE:
leaving groups size None crashes
adjusting groups size for GPU (1,600) and IntelHD (1,50) lets me go up to some 6400 rows. However for larger size it crashes on GPU and IntelHD just freezes and does nothing ( 0% on resource monitor)
2.I have Intel HD4600 and Nvidia K1100M GPU available, however the Intel is ~2 times faster. I understand partially this is due to the fact that I don't need to copy my arrays to internal Intel memory different from my external GPU. However I expected marginal difference. Is this normal or should my code be better optimized to use on GPU ? (resolved)
Thanks for your help !!
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
import matplotlib.pyplot as plt
def resample_opencl(y,key='GPU'):
#
# selecting to run on GPU or CPU
#
newlen = 1200
my_platform = cl.get_platforms()[0]
device =my_platform.get_devices()[0]
for found_platform in cl.get_platforms():
if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):
my_platform = found_platform
device =my_platform.get_devices()[0]
print("using GPU")
#
#Create context for GPU/CPU
#
ctx = cl.Context([device])
#
# Create queue for each kernel execution
#
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
# queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void resample(
int M,
__global const float *y_g,
__global float *res_g)
{
int row = get_global_id(0);
int col = get_global_id(1);
int gs = get_global_size(1);
__private float tmp,tmp2,x;
__private float t;
t = (float)(col)/2+1;
tmp=0;
tmp2=0;
for (int i=0; i<M ; i++)
{
x = (float)(i+1);
tmp2 = (t- x)*3.14159;
if (t == x) {
tmp += y_g[row*M + i] ;
}
else
tmp += y_g[row*M +i] * sin(tmp2)/tmp2;
}
res_g[row*gs + col] = tmp;
}
""").build()
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
M = np.array(600).astype(np.int32)
prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event = cl.enqueue_copy(queue, res, res_g)
print("success")
event.wait()
return res,event
if __name__ == "__main__":
#
# this is the number i need to increase ( up to some 100 000)
numrows = 2000
Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))
x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
y= np.zeros(( numrows,np.size(x)))
y[:] = Gaussian(x).astype(np.float32)
y = y.astype(np.float32)
res,event = resample_opencl(y,'GPU')
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
#
# test plot if it worked
#
plt.figure()
plt.plot(x,y[1,:],'+')
plt.plot(t,res[1,:])
Re 1.
Your newlen has to be divisible by 200 because that is what you set as local dimensions (1,200). I increased this to 9600 and that still worked fine.
Update
After your update I would suggest not specifying local dimensions but let implementation to decide:
prg.resample(queue, res.shape, None,M, y_g, res_g)
Also it may improve the performance ifnewlen and numrows were multiply of 16.
It is not a rule that Nvidia GPU must perform better than Intel GPU especially that according to Wikipedia there is not a big difference in GFLOPS between them (549.89 vs 288–432). This GFLOPS comparison should be taken with grain of salt as one algorithm may be more suitable to one GPU than the other. In other words looking by this numbers you may expect one GPU to be typically faster than the other but that may vary from algorithm to algorithm.
Kernel for 100000 rows requires:
y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB
Quadro K1100M has 2GB of global memory and that should be sufficient for processing 100000 rows. Intel HD 4600 from what I found is limited by memory in the system so I suspect that shouldn't be a problem too.
Re 2.
The time is not measured correctly. Instead of measuring kernel execution time, the time of copying data back to host is being measured. So no surprise that this number is lower for CPU. To measure kernel execution time do:
event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
I don't know how to measure the whole thing including copying data back to host using OpenCL profiling events in pyopencl but using just python gives similar results:
start = time.time()
... #code to be measured
end = time.time()
print(end - start)
I think I figured out the issue:
IntelHd : turning off profiling fixes everything. Can run the code without any issues.
K1100M GPU still crashes but I suspect that this might be the timeout issue as I am using the same video card on my display.

Circle Piping to and from 2 Python Subprocesses

I needed help regarding the subprocess module. This question might sound repeated, and I have seen a number of articles related to it in a number of ways. But even so I am unable to solve my problem. It goes as follows:
I have a C program 2.c it's contents are as follows:
#include<stdio.h>
int main()
{
int a;
scanf("%d",&a);
while(1)
{
if(a==0) //Specific case for the first input
{
printf("%d\n",(a+1));
break;
}
scanf("%d",&a);
printf("%d\n",a);
}
return 0;
}
I need to write a python script which first compiles the code using subprocess.call() and then opens two process using Popen to execute the respective C-program. Now the output of the first process must be the input of the second and vice versa. So essentially, if my initial input was 0, then the first process outputs 2, which is taken by second process. It in turn outputs 3 and so on infinitely.
The below script is what I had in mind, but it is flawed. If someone can help me I would very much appreciate it.
from subprocess import *
call(["gcc","2.c"])
a = Popen(["./a.out"],stdin=PIPE,stdout=PIPE) #Initiating Process
a.stdin.write('0')
temp = a.communicate()[0]
print temp
b = Popen(["./a.out"],stdin=PIPE,stdout=PIPE) #The 2 processes in question
c = Popen(["./a.out"],stdin=PIPE,stdout=PIPE)
while True:
b.stdin.write(str(temp))
temp = b.communicate()[0]
print temp
c.stdin.write(str(temp))
temp = c.communicate()[0]
print temp
a.wait()
b.wait()
c.wait()
If you want the output of the first command a to go as the input of the second command b and in turn b's output is a's input—in a circle like a snake eating its tail— then you can't use .communicate() in a loop: .communicate() doesn't return until the process is dead and all the output is consumed.
One solution is to use a named pipe (if open() doesn't block in this case on your system):
#!/usr/bin/env python3
import os
from subprocess import Popen, PIPE
path = 'fifo'
os.mkfifo(path) # create named pipe
try:
with open(path, 'r+b', 0) as pipe, \
Popen(['./a.out'], stdin=PIPE, stdout=pipe) as b, \
Popen(['./a.out'], stdout=b.stdin, stdin=pipe) as a:
pipe.write(b'10\n') # kick-start it
finally:
os.remove(path) # clean up
It emulates a < fifo | b > fifo shell command from #alexander barakin answer.
Here's more complex solution that funnels the data via the python parent process:
#!/usr/bin/env python3
import shutil
from subprocess import Popen, PIPE
with Popen(['./a.out'], stdin=PIPE, stdout=PIPE, bufsize=0) as b, \
Popen(['./a.out'], stdout=b.stdin, stdin=PIPE, bufsize=0) as a:
a.stdin.write(b'10\n') # kick-start it
shutil.copyfileobj(b.stdout, a.stdin) # copy b's stdout to a' stdin
This code connects a's output to b's input using redirection via OS pipe (as a | b shell command does).
To complete the circle, b's output is copied to a's input in the parent Python code using shutil.copyfileobj().
This code may have buffering issues: there are multiple buffers in between the processes: C stdio buffers, buffers in Python file objects wrapping the pipes (controlled by bufsize).
bufsize=0 turns off the buffering on the Python side and the data is copied as soon as it is available. Beware, bufsize=0 may lead to partial writes—you might need to inline copyfileobj() and call write() again until all read data is written.
Call setvbuf(stdout, (char *) NULL, _IOLBF, 0), to make the stdout line-buffered inside your C program:
#include <stdio.h>
int main(void)
{
int a;
setvbuf(stdout, (char *) NULL, _IOLBF, 0); /* make line buffered stdout */
do {
scanf("%d",&a);
printf("%d\n",a-1);
fprintf(stderr, "%d\n",a); /* for debugging */
} while(a > 0);
return 0;
}
Output
10
9
8
7
6
5
4
3
2
1
0
-1
The output is the same.
Due to the way the C child program is written and executed, you might also need to catch and ignore BrokenPipeError exception at the end on a.stdin.write() and/or a.stdin.close() (a process may be already dead while there is uncopied data from b).
Problem is here
while True:
b.stdin.write(str(temp))
temp = b.communicate()[0]
print temp
c.stdin.write(str(temp))
temp = c.communicate()[0]
print temp
Once communicate has returned, it does noting more. You have to run the process again. Plus you don't need 2 processes open at the same time.
Plus the init phase is not different from the running phase, except that you provide the input data.
what you could do to simplify and make it work:
from subprocess import *
call(["gcc","2.c"])
temp = str(0)
while True:
b = Popen(["./a.out"],stdin=PIPE,stdout=PIPE) #The 2 processes in question
b.stdin.write(temp)
temp = b.communicate()[0]
print temp
b.wait()
Else, to see 2 processes running in parallel, proving that you can do that, just fix your loop as follows (by moving the Popen calls in the loop):
while True:
b = Popen(["./a.out"],stdin=PIPE,stdout=PIPE) #The 2 processes in question
c = Popen(["./a.out"],stdin=PIPE,stdout=PIPE)
b.stdin.write(str(temp))
temp = b.communicate()[0]
print temp
c.stdin.write(str(temp))
temp = c.communicate()[0]
print temp
better yet. b output feeds c input:
while True:
b = Popen(["./a.out"],stdin=PIPE,stdout=PIPE) #The 2 processes in question
c = Popen(["./a.out"],stdin=b.stdout,stdout=PIPE)
b.stdin.write(str(temp))
temp = c.communicate()[0]
print temp

How can I keep memory from exploding when child processes touch variable metadata?

Linux uses COW to keep memory usage low after a fork, but the way Perl 5 variables work in perl seems to defeat this optimization. For instance, for the variable:
my $s = "1";
perl is really storing:
SV = PV(0x100801068) at 0x1008272e8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x100201d50 "1"\0
CUR = 1
LEN = 16
When you use that string in a numeric context, it modifies the C struct representing the data:
SV = PVIV(0x100821610) at 0x1008272e8
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x100201d50 "1"\0
CUR = 1
LEN = 16
The string pointer itself did not change (it is still 0x100201d50), but now it is in a different C struct (a PVIV instead of a PV). I did not modify the value at all, but suddenly I am paying a COW cost. Is there any way to lock the perl representation of a Perl 5 variable so that this time saving (perl doesn't have to convert "0" to 0 a second time) hack doesn't hurt my memory usage?
Note, the representations above were generated from this code:
perl -MDevel::Peek -e '$s = "1"; Dump $s; $s + 0; Dump $s'
The only solution I have found so far, is to make sure I force perl to do all of the conversions I expect in the parent process. And you can see from the code below, even that only helps a little.
Results:
Useless use of addition (+) in void context at z.pl line 34.
Useless use of addition (+) in void context at z.pl line 45.
Useless use of addition (+) in void context at z.pl line 51.
before eating memory
used memory: 71
after eating memory
used memory: 119
after 100 forks that don't reference variable
used memory: 144
after children are reaped
used memory: 93
after 100 forks that touch the variables metadata
used memory: 707
after children are reaped
used memory: 93
after parent has updated the metadata
used memory: 109
after 100 forks that touch the variables metadata
used memory: 443
after children are reaped
used memory: 109
Code:
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
sub print_mem {
print #_, "used memory: ", `free -m` =~ m{cache:\s+([0-9]+)}s, "\n";
}
print_mem("before eating memory\n");
my #big = ("1") x (1_024 * 1024);
my $pm = Parallel::ForkManager->new(100);
print_mem("after eating memory\n");
for (1 .. 100) {
next if $pm->start;
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that don't reference variable\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
for (1 .. 100) {
next if $pm->start;
$_ + 0 for #big; #force an update to the metadata
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that touch the variables metadata\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
$_ + 0 for #big; #force an update to the metadata
print_mem("after parent has updated the metadata\n");
for (1 .. 100) {
next if $pm->start;
$_ + 0 for #big; #force an update to the metadata
sleep 2;
$pm->finish;
}
print_mem("after 100 forks that touch the variables metadata\n");
$pm->wait_all_children;
print_mem("after children are reaped\n");
Anyway if you avoid COW on start and during run you should not forgot END phase of lifetime. In shutdown there are two GC phases when in first there are ref counts updates so it can kill you in nice way. You can in solve it ugly:
END { kill 9, $$ }
This goes without saying, but COW doesn't happen on a per-struct basis, but on a memory page basis. So it's enough that one thing in an entire memory page be modified like this for you to pay the copying cost.
On Linux you can query the page size like this:
getconf PAGESIZE
On my system that's 4096 bytes. You can fit a lot of Perl scalar structs in that space. If one of those things gets modified Linux will have to copy the entire thing.
This is why using memory arenas is a good idea in general. You should separate your mutable and immutable data so that you won't have to pay COW costs for immutable data just because it happened to reside in the same memory page as mutable data.

Resources