I have a question in the process of learning using pytorch.
referece: https://pytorch.org/docs/stable/cuda.html
current_memory = torch.cuda.memory_allocated(device=device) # return current GPU memory
total_free_memory = torch.cuda.mem_get_info(device=device) # returns the total, unused GPU memory
total = mem_get_info[1]
unused = mem_get_info[0]
used_memory = total - unused
In my code,
total : 11.77GB
used_memory: 4.63GB
current_memory: 3.1GB
I wonder the used_memory - current_memory is a leak memory.
why are the values used_memory and current_memory have different??
Thanks for any help.
Related
I am running object detection on buffers from gstreamer and am utilizing gst_buffer_extract_dup to create an image array from a Gstreamer buffer. Here is a code snip:
gstbuffer = gstsample.get_buffer()
caps_format = gstsample.get_caps().get_structure(0) # Gst.Structure
frmt_str = caps_format.get_value('format')
video_format = GstVideo.VideoFormat.from_string(frmt_str)
p, q = caps_format.get_value('width'), caps_format.get_value('height')
buf = gstbuffer.extract_dup(0, gstbuffer.get_size())
array = np.ndarray(shape=(q, p, 3), \
buffer = buf, \
dtype='uint8')
svg = self.user_function(gstbuffer, array, self.src_size, self.get_box())
I have discovered a substantial memory leak causing the program to crash within 10 minutes and have identified extract_dup as the likely cause as the GStreamer documentation says it needs to be freed with g_free. The (potential) problem is that I cannot figure out the syntax for doing this. trying GLib.free(buf) results in the error "GLib.free(buf)
ValueError: Pointer arguments are restricted to integers, capsules, and None. See: https://bugzilla.gnome.org/show_bug.cgi?id=683599
"
How would I free this memory? Furthermore, how can I confirm that this memory isn't being freed and is the cause of my leak?
I am using Ray in order to parallelize some computations, but it seems to be accumulating spillage..
I don't mind it spilling objects to my hard drive, but I do if it means using +130 GiB for processing about 1.6 GiB of simulations..
Bellow is a trace of what is happening:
Number of steps: 55 (9,091 simulations each)
0%
[2m[36m(raylet)[0m Spilled 3702 MiB, 12 objects, write throughput 661 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[2m[36m(raylet)[0m Spilled 5542 MiB, 17 objects, write throughput 737 MiB/s.
2%
[2m[36m(raylet)[0m Spilled 9883 MiB, 33 objects, write throughput 849 MiB/s.
5%
[2m[36m(raylet)[0m Spilled 16704 MiB, 58 objects, write throughput 997 MiB/s.
13%
[2m[36m(raylet)[0m Spilled 32903 MiB, 124 objects, write throughput 784 MiB/s.
29%
[2m[36m(raylet)[0m Spilled 66027 MiB, 268 objects, write throughput 661 MiB/s.
53%
[2m[36m(raylet)[0m Spilled 131920 MiB, 524 objects, write throughput 461 MiB/s.
60%
And here is the code I am running:
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
results = ray.get([get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
return np.vstack(results)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
results = get_res_parallel(simulations=step_simulations)
expected_result += results.mean(axis=0)
print(f"\r100%")
return expected_result / num_steps
Running on a Mac M1 with 16 GiB of RAM, Ray 2.0.0 and Python 3.9.13.
Question
Given my code, is it normal behavior?
What can I do to resolve this problem? Force garbage collection?
Do you know the expected size of the array returned by get_res_ray?
Ray will spill objects returned by remote tasks as well as objects passed to remote tasks, so in this case there are two possible places that can cause memory pressure:
The ObjectRefs returned by get_res_ray.remote
The simulations passed to get_res_ray.remote. Since these are large, Ray will automatically put these in the local object store to reduce the size of the task definition.
It may be expected to spill if the size of these objects combined is greater than 30% of the RAM on your machine (this is the default size of Ray's object store). It's not suggested to increase the size of the object store, since this can cause memory pressure on the functions instead.
But you can try to either process fewer things in each iteration and/or you can try to release ObjectRefs sooner. In particular, you should try to release the results from the previous iteration as soon as possible, so that Ray can GC the objects for you. You can do this by calling del results once you're done using them.
Here's a full suggestion that will do the same thing by feeding the array results into another task instead of getting them on the driver. This is usually a better approach because it avoids adding memory pressure on the driver and you're less likely to be accidentally pinning results in the driver's memory.
#ray.remote
def mean(*arrays):
return np.vstack(arrays).mean(axis=0)
def get_res_parallel(simulations, num_loads=num_cpus):
load_size = simulations.shape[0] / num_loads
simulations_per_load = [simulations[round(n * load_size): round((n+1) * load_size)]
for n in range(num_loads)]
# 2D numpy arrays
# Use the * syntax in Python to unpack the ObjectRefs as function arguments.
result = mean.remote(*[get_res_ray.remote(simulations=simulations)
for simulations in simulations_per_load])
# We never have the result arrays stored in driver's memory.
return ray.get(result)
MAX_RAM = 6 * 2**30 # 6 GiB
def get_expected_res(simulations, MAX_RAM=MAX_RAM):
expected_result = np.zeros(shape=87_381, dtype=np.float64)
bytes_per_res = len(expected_result) * (64 // 8)
num_steps = simulations.shape[0] * bytes_per_res // MAX_RAM + 1
step_size = simulations.shape[0] / num_steps
print(f"Number of steps: {num_steps} ({step_size:,.0f} simulations each)")
for n in range(num_steps):
print(f"\r{n / num_steps:.0%}", end="")
step_simulations = simulations[round(n * step_size): round((n+1) * step_size)]
expected_result += get_res_parallel(simulations=step_simulations)
print(f"\r100%")
return expected_result / num_steps
I'm trying to create a zram device on my target device. My target can not allocate memory if the zram disksize is above 100GB, but it's okay with the disksize of 50GB or less.
Is there any limit in setting zram device disksize on Linux? My target device only has 2GB of RAM memory.
I guess you can give a number up to UINT64_MAX - 4095 = 18446744073709547520 on a 64-bit platform.
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.h#L101
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L1506
https://github.com/torvalds/linux/blob/master/drivers/block/zram/zram_drv.c#L901
So what we have:
... disksize_store(...) {
u64 disksize;
...
// ok, we can give at least UINT64_MAX here.
disksize = unsigned long long memparse(...);
// PAGE_ALIGN, PAGE_SIZE = 1<<12
disksize = PAGE_ALIGN(disksize)
= (((disksize)+((PAGE_SIZE)-1))&(~((typeof(disksize))(PAGE_SIZE)-1)))
= (disksize + ((1<<12)-1))&(~((1<<12)-1))
= (disksize + 4095) & 0xfffffffffffff000
// ^^^^^^^^^^^^^^^ this can overflow
// so max number is UINT64_MAX - 4095 so it doesn't overflow
// otherwise this macro will return 0
...
if (!zram_meta_alloc(..., disksize) {
...
return ...;
}
...
zram->disksize = disksize;
...
}
So let's see into zram_meta_alloc:
... zram_meta_alloc(..., disksize) {
...
num_pages = disksize >> PAGE_SHIFT;
// max num_pages = 0xfffffffffffff = UINT64_MAX >> PAGE_SHIFT
... = vzalloc(num_pages * sizeof(*zram->table));
// ^^^^^^^^^^^^^^^ this can overflow
...
}
vzallloc takes as argument unsigned long. ULONG_MAX should be UINT64_MAX on 64-bit platform. sizeof(*zram->table) is equal to sizeof(unsigned long) + sizeof(unsigned long) + [optional: + sizeof(ktime_t)] + padding (see here). Without padding, assuming 64-bit platform, sizeof(unsigned long) = 8 that should be equal to 8+8[+8] = 16 or 24. But anyway, maximum num_pages is equal to UINT64_MAX >> 12, so to overflow it on 64bit multiplication we would need sizeof(*zram->table) = 2^PAGE_SIZE = 4096, and that shouldn't happen (unless the compiler decides to give over 4000 bytes of padding into the zram->table struct). So we are left with UINT64_MAX - 4095.
So we are left, that the maximum number of disksize is UINT64_MAX-4095. If you give the disksize equal to UINT64_MAX - x, where 0 <= x < 4095, than because of PAGE_ALIGN macro, the disksize will be effectively set to 0. Probably this should be brought up to a kernel developer and they should modify the PAGE_ALIGN macro to support such numbers.
6 days ago to vzalloc calls the call to array_size was added to protect against overflow with this commit.
There is no limit but there is an overhead.
"Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful."
https://www.kernel.org/doc/Documentation/blockdev/zram.txt
Also disk_size is a virtual size purely dependent on the input and the compression ratio that receives via chosen alg. Disk-size is the max uncompressed size and general disk parameters.
The only 'actual' control is via mem_limit which is compressed size + disk & zram overheads.
Compression ratio is completely dependent on comp alg chosen from /proc/crypto as zlib & zstd are far more effective but are far slower. It is also very dependent on input as with text zlib & zstd can be over double that what lzo & lz4 will achieve.
If the input is already compressed any alg might garner little to zero compression and without a mem_limit could grab much precious memory from the system.
Mem_limit is the max you are prepared zram to grab from system and a disk-size any more than the compression ratio expected applied to mem_limit is likely a waste.
It will never get used but be part of the 0.1% empty creation overhead.
Maybe try https://github.com/StuartIanNaylor/zram-config
I am very new to opencl and trying my first program. I implemented a simple sinc filtering of waveforms. The code works, however i have two questions:
Once I increase the size of the input matrix (numrows needs to go up to 100 000) I get (clEnqueueReadBuffer failed: OUT_OF_RESOURCES) even though matrix is relatively small (few mb). This is to some extent related to the work group size I think, but could someone elaborate how I could fix this issue ?
Could it be driver issue ?
UPDATE:
leaving groups size None crashes
adjusting groups size for GPU (1,600) and IntelHD (1,50) lets me go up to some 6400 rows. However for larger size it crashes on GPU and IntelHD just freezes and does nothing ( 0% on resource monitor)
2.I have Intel HD4600 and Nvidia K1100M GPU available, however the Intel is ~2 times faster. I understand partially this is due to the fact that I don't need to copy my arrays to internal Intel memory different from my external GPU. However I expected marginal difference. Is this normal or should my code be better optimized to use on GPU ? (resolved)
Thanks for your help !!
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import os
os.environ['PYOPENCL_COMPILER_OUTPUT'] = '1'
import matplotlib.pyplot as plt
def resample_opencl(y,key='GPU'):
#
# selecting to run on GPU or CPU
#
newlen = 1200
my_platform = cl.get_platforms()[0]
device =my_platform.get_devices()[0]
for found_platform in cl.get_platforms():
if (key == 'GPU') and (found_platform.name == 'NVIDIA CUDA'):
my_platform = found_platform
device =my_platform.get_devices()[0]
print("using GPU")
#
#Create context for GPU/CPU
#
ctx = cl.Context([device])
#
# Create queue for each kernel execution
#
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
# queue = cl.CommandQueue(ctx)
prg = cl.Program(ctx, """
__kernel void resample(
int M,
__global const float *y_g,
__global float *res_g)
{
int row = get_global_id(0);
int col = get_global_id(1);
int gs = get_global_size(1);
__private float tmp,tmp2,x;
__private float t;
t = (float)(col)/2+1;
tmp=0;
tmp2=0;
for (int i=0; i<M ; i++)
{
x = (float)(i+1);
tmp2 = (t- x)*3.14159;
if (t == x) {
tmp += y_g[row*M + i] ;
}
else
tmp += y_g[row*M +i] * sin(tmp2)/tmp2;
}
res_g[row*gs + col] = tmp;
}
""").build()
mf = cl.mem_flags
y_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=y)
res = np.zeros((np.shape(y)[0],newlen)).astype(np.float32)
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)
M = np.array(600).astype(np.int32)
prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event = cl.enqueue_copy(queue, res, res_g)
print("success")
event.wait()
return res,event
if __name__ == "__main__":
#
# this is the number i need to increase ( up to some 100 000)
numrows = 2000
Gaussian = lambda t : 10 * np.exp(-(t - 50)**2 / (2. * 2**2))
x = np.linspace(1, 101, 600, endpoint=False).astype(np.float32)
t = np.linspace(1, 101, 1200, endpoint=False).astype(np.float32)
y= np.zeros(( numrows,np.size(x)))
y[:] = Gaussian(x).astype(np.float32)
y = y.astype(np.float32)
res,event = resample_opencl(y,'GPU')
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
#
# test plot if it worked
#
plt.figure()
plt.plot(x,y[1,:],'+')
plt.plot(t,res[1,:])
Re 1.
Your newlen has to be divisible by 200 because that is what you set as local dimensions (1,200). I increased this to 9600 and that still worked fine.
Update
After your update I would suggest not specifying local dimensions but let implementation to decide:
prg.resample(queue, res.shape, None,M, y_g, res_g)
Also it may improve the performance ifnewlen and numrows were multiply of 16.
It is not a rule that Nvidia GPU must perform better than Intel GPU especially that according to Wikipedia there is not a big difference in GFLOPS between them (549.89 vs 288–432). This GFLOPS comparison should be taken with grain of salt as one algorithm may be more suitable to one GPU than the other. In other words looking by this numbers you may expect one GPU to be typically faster than the other but that may vary from algorithm to algorithm.
Kernel for 100000 rows requires:
y_g: 100000 * 600 * 4 = 240000000 bytes =~ 229MB
res_g: 100000 * 1200 * 4 = 480000000 bytes =~ 457,8MB
Quadro K1100M has 2GB of global memory and that should be sufficient for processing 100000 rows. Intel HD 4600 from what I found is limited by memory in the system so I suspect that shouldn't be a problem too.
Re 2.
The time is not measured correctly. Instead of measuring kernel execution time, the time of copying data back to host is being measured. So no surprise that this number is lower for CPU. To measure kernel execution time do:
event = prg.resample(queue, res.shape, (1,200),M, y_g, res_g)
event.wait()
print ("OpenCl GPU profiler",(event.profile.end-event.profile.start)*1e-9)
I don't know how to measure the whole thing including copying data back to host using OpenCL profiling events in pyopencl but using just python gives similar results:
start = time.time()
... #code to be measured
end = time.time()
print(end - start)
I think I figured out the issue:
IntelHd : turning off profiling fixes everything. Can run the code without any issues.
K1100M GPU still crashes but I suspect that this might be the timeout issue as I am using the same video card on my display.
Situation: estimate if you can compute big matrix with your Ram and Swap in Linux Matlab
I need the sum of Mem and Swap, corresponding values by free -m under Heading total in Linux
total used free shared buff/cache available
Mem: 7925 3114 3646 308 1164 4220
Swap: 28610 32 28578
Free Ram memory in Matlab by
% http://stackoverflow.com/a/12350678/54964
[r,w] = unix('free | grep Mem');
stats = str2double(regexp(w, '[0-9]*', 'match'));
memsize = stats(1)/1e6;
freeRamMem = (stats(3)+stats(end))/1e6;
Free Swap memory in Matlab: ...
Relation between Memory requirement and Matrix size of Matlab: ...
Testing Suever's 2nd iteration
Suever's command gives me 29.2 GB that is corresponding to free's output so correct
$ free
total used free shared buff/cache available
Mem: 8115460 4445520 1956672 350692 1713268 3024604
Swap: 29297656 33028 29264628
System: Linux Ubuntu 16.04 64 bit
Linux kernel: 4.6
Linux kernel options: wl, zswap
Matlab: 2016a
Hardware: Macbook Air 2013-mid
Ram: 8 GB
Swap: 28 Gb on SSD (set up like in the thread How to Allocate More Space to Swap and Increase its Size Greater than Ram?)
SSD: 128 GB
You can just make a slight modification to the code that you've posted to get the swap amount.
function freeMem = freeMemory(type)
[r, w] = unix(['free | grep ', type]);
stats = str2double(regexp(w, '[0-9]*', 'match'));
memsize = stats(1)/1e6;
if numel(stats) > 3
freeMem = (stats(3)+stats(end))/1e6;
else
freeMem = stats(3)/1e6;
end
end
totalFree = freeMemory('Mem') + freeMemory('Swap')
To figure out how much memory a matrix takes up, use the size of the datatype and multiply by the number of elements as a first approximation.