Feature extraction in loop seems to cause memory leak in pytorch - memory-leaks

I have spent considerable time trying to debug some pytorch code which I have created a minimal example of for the purpose of helping to better understand what the issue might be.
I have removed all necessary portions of the code which are unrelated to the issue so the remaining piece of code won't make much sense from a functional standpoint but it still displays the error I'm facing.
The overall task I'm working on is in a loop and every pass of the loop is computing the embedding of the image and adding it to a variable storing it. It's effectively aggregating it (not concatenating, so the size remains the same). I don't expect the number of iterations to force the datatype to overflow, I don't see this happening here nor in my code.
I have added multiple metrics to evaluate the size of the tensors I'm working with to make sure they're not growing in memory footprint
I'm checking the overall GPU memory usage to verify the issue leading to the final RuntimeError: CUDA out of memory..
My environment is as follows:
- python 3.6.2
- Pytorch 1.4.0
- Cudatoolkit 10.0
- Driver version 410.78
- GPU: Nvidia GeForce GT 1030 (2GB VRAM)
(though I've replicated this experiment with the same result on a Titan RTX with 24GB,
same pytorch version and cuda toolkit and driver, it only goes out of memory further in the loop).
Complete code below. I have marked 2 lines as culprits, as deleting them removes the issue, though obviously I need to find a way to execute them without having memory issues. Any help would be much appreciated! You may try with any image named "source_image.bmp" to replicate the issue.
import torch
from PIL import Image
import torchvision
from torchvision import transforms
from pynvml import nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo, nvmlInit
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0' # this is necessary on my system to allow the environment to recognize my nvidia GPU for some reason
os.environ['CUDA_LAUNCH_BLOCKING'] = '1' # to debug by having all CUDA functions executed in place
torch.set_default_tensor_type('torch.cuda.FloatTensor')
# Preprocess image
tfms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),])
img = tfms(Image.open('source_image.bmp')).unsqueeze(0).cuda()
model = torchvision.models.resnet50(pretrained=True).cuda()
model.eval() # we put the model in evaluation mode, to prevent storage of gradient which might accumulate
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
print(f'Total available memory : {info.total / 1000000000}')
feature_extractor = torch.nn.Sequential(*list(model.children())[:-1])
orig_embedding = feature_extractor(img)
embedding_depth = 2048
mem0 = 0
embedding = torch.zeros(2048, img.shape[2], img.shape[3]) #, dtype=torch.float)
patch_size=[4,4]
patch_stride=[2,2]
patch_value=0.0
# Here, we iterate over the patch placement, defined at the top left location
for row in range(img.shape[2]-1):
for col in range(img.shape[3]-1):
print("######################################################")
######################################################
# Isolated line, culprit 1 of the GPU memory leak
######################################################
patched_embedding = feature_extractor(img)
delta_embedding = (patched_embedding - orig_embedding).view(-1, 1, 1)
######################################################
# Isolated line, culprit 2 of the GPU memory leak
######################################################
embedding[:,row:row+1,col:col+1] = torch.add(embedding[:,row:row+1,col:col+1], delta_embedding)
print("img size:\t\t", img.element_size() * img.nelement())
print("patched_embedding size:\t", patched_embedding.element_size() * patched_embedding.nelement())
print("delta_embedding size:\t", delta_embedding.element_size() * delta_embedding.nelement())
print("Embedding size:\t\t", embedding.element_size() * embedding.nelement())
del patched_embedding, delta_embedding
torch.cuda.empty_cache()
info = nvmlDeviceGetMemoryInfo(h)
print("\nMem usage increase:\t", info.used / 1000000000 - mem0)
mem0 = info.used / 1000000000
print(f'Free:\t\t\t {(info.total - info.used) / 1000000000}')
print("Done.")

Add this to your code as soon as you load the model
for param in model.parameters():
param.requires_grad = False
from https://pytorch.org/docs/stable/notes/autograd.html#excluding-subgraphs-from-backward

Related

How to Fully Utilize CPU cores for skopt.forest_minimize

So I have the following code for running skopt.forest_minimize(), but the biggest challenge I am facing right now is that it is taking upwards of days to finish running even just 2 iterations.
SPACE = [skopt.space.Integer(4, max_neighbour, name='n_neighbors', prior='log-uniform'),
skopt.space.Integer(6, 10, name='nr_cubes', prior='uniform'),
skopt.space.Categorical(overlap_cat, name='overlap_perc')]
#skopt.utils.use_named_args(SPACE)
def objective(**params):
score, scomp = tune_clustering(X_cont=X_cont, df=df, pl_brewer=pl_brewer, **params)
if score == 0:
print('saving new scomp')
with open(scomp_file, 'w') as filehandle:
json.dump(scomp, filehandle, default = json_default)
return score
results = skopt.forest_minimize(objective, SPACE, n_calls=1, n_initial_points=1, callback=[scoring])
Is it possible to optimize the following code so that it can compute faster? I noticed that it was barely making use of my CPU, highest CPU utilized is about 30% (it's i7 9th gen with
8 cores).
Also a question while I'm at it, is it possible to utilize a GPU for these computational tasks? I have a 3050 that I can use.

RuntimeError on running ALBERT for obtaining encoding vectors from text

I’m trying to get feature vectors from the encoder model using pre-trained ALBERT v2 weights. i have a nvidia 1650ti gpu (4 GB) , and sufficient RAM(8GB) but for some reason I’m getting Runtime error saying -
RuntimeError: [enforce fail at …\c10\core\CPUAllocator.cpp:75] data.
DefaultCPUAllocator: not enough memory: you tried to allocate
491520000 bytes. Buy new RAM!
I’m really new to pytorch and deep learning in general. Can anyone please tell me what is wrong?
My entire code -
encoded_test_data = tokenized_test_values[‘input_ids’]
encoded_test_masks = tokenized_test_values[‘attention_mask’]
encoded_train_data = torch.from_numpy(encoded_train_data).to(device)
encoded_masks = torch.from_numpy(encoded_masks).to(device)
encoded_test_data = torch.from_numpy(encoded_test_data).to(device)
encoded_test_masks = torch.from_numpy(encoded_test_masks).to(device)
config = EncoderDecoderConfig.from_encoder_decoder_configs(BertConfig(),BertConfig())
EnD_model = EncoderDecoderModel.from_pretrained(‘albert-base-v2’,config=config)
feature_extractor = EnD_model.get_encoder()
feature_vector = feature_extractor.forward(input_ids=encoded_train_data,attention_mask = encoded_masks)
feature_test_vector = feature_extractor.forward(input_ids = encoded_test_data, attention_mask = encoded_test_masks)
Also 491520000 bytes is about 490 MB which should not be a problem.
I tried reducing the number of training examples and also the length of the maximum padded input . The OOM error still exists even though the required space now is 153 MB , which should easily be managable.
I also have maxed out the RAM limit of the heap of pycharm software to 2048 MB. I really dont know what to do now…

Linux memory allocation on Rstudio/Rstudio Server

I am trying to do clustering with CLARA using Rstudio on Linux and I have a very large dataset.
However, it seemed that the memory is not enough for the whole dataset?
## Estimating the number of clusters ----
fviz_nbclust(df, clara, method = "silhouette", k.max = 15)
It showed me this:
Error: cannot allocate vector of size 339.8 GB
So I tried all of this and it still didn't work. memory.limit is also specific for Windows only (I still gave it a try tho).
# devtools::install_github("krlmlr/ulimit")
# gc()
# memory.limit(9999999999)
#
#
# install.packages("devtools", dependencies = TRUE)
# devtools::install_github("krlmlr/ulimit")
# ulimit::memory_limit(2000)
#
# devtools::install_github("jeroen/unix")
#
#
# if(.Platform$OS.type == "windows") withAutoprint({
# memory.size()
# memory.size(TRUE)
# memory.limit()
# })
# memory.limit(size=56000)
# memory.size(max = FALSE)
Can somebody help me with this?
Any help would be appreciated!
The error simply means that it cannot allocate 339.8 GB to your RAM. Do you have 360GB of RAM?
If not, you will just have to dplyr::nsample() and just run the function on a subset of your dataset.

Gradients vanishing despite using Kaiming initialization

I was implementing a conv block in pytorch with activation function(prelu). I used Kaiming initilization to initialize all my weights and set all the bias to zero. However as I tested these blocks (by stacking 100 such conv and activation blocks on top of each other), I noticed that the output I am getting values of the order of 10^(-10). Is this normal, considering I am stacking upto 100 layers. Adding a small bias to each layer fixes the problem. But in Kaiming initialization the biases are supposed to be zero.
Here is the conv block code
from collections import Iterable
def convBlock(
input_channels, output_channels, kernel_size=3, padding=None, activation="prelu"
):
"""
Initializes a conv block using Kaiming Initialization
"""
padding_par = 0
if padding == "same":
padding_par = same_padding(kernel_size)
conv = nn.Conv2d(input_channels, output_channels, kernel_size, padding=padding_par)
relu_negative_slope = 0.25
act = None
if activation == "prelu" or activation == "leaky_relu":
nn.init.kaiming_normal_(conv.weight, a=relu_negative_slope, mode="fan_in")
if activation == "prelu":
act = nn.PReLU(init=relu_negative_slope)
else:
act = nn.LeakyReLU(negative_slope=relu_negative_slope)
if activation == "relu":
nn.init.kaiming_normal_(conv.weight, nonlinearity="relu")
act = nn.ReLU()
nn.init.constant_(conv.bias.data, 0)
block = nn.Sequential(conv, act)
return block
def flatten(lis):
for item in lis:
if isinstance(item, Iterable) and not isinstance(item, str):
for x in flatten(item):
yield x
else:
yield item
def Sequential(args):
flattened_args = list(flatten(args))
return nn.Sequential(*flattened_args)
This is the test Code
ls=[]
for i in range(100):
ls.append(convBlock(3,3,3,"same"))
model=Sequential(ls)
test=np.ones((1,3,5,5))
model(torch.Tensor(test))
And the output I am getting is
tensor([[[[-1.7771e-10, -3.5088e-10, 5.9369e-09, 4.2668e-09, 9.8803e-10],
[ 1.8657e-09, -4.0271e-10, 3.1189e-09, 1.5117e-09, 6.6546e-09],
[ 2.4237e-09, -6.2249e-10, -5.7327e-10, 4.2867e-09, 6.0034e-09],
[-1.8757e-10, 5.5446e-09, 1.7641e-09, 5.7018e-09, 6.4347e-09],
[ 1.2352e-09, -3.4732e-10, 4.1553e-10, -1.2996e-09, 3.8971e-09]],
[[ 2.6607e-09, 1.7756e-09, -1.0923e-09, -1.4272e-09, -1.1840e-09],
[ 2.0668e-10, -1.8130e-09, -2.3864e-09, -1.7061e-09, -1.7147e-10],
[-6.7161e-10, -1.3440e-09, -6.3196e-10, -8.7677e-10, -1.4851e-09],
[ 3.1475e-09, -1.6574e-09, -3.4180e-09, -3.5224e-09, -2.6642e-09],
[-1.9703e-09, -3.2277e-09, -2.4733e-09, -2.3707e-09, -8.7598e-10]],
[[ 3.5573e-09, 7.8113e-09, 6.8232e-09, 1.2285e-09, -9.3973e-10],
[ 6.6368e-09, 8.2877e-09, 9.2108e-10, 9.7531e-10, 7.0011e-10],
[ 6.6954e-09, 9.1019e-09, 1.5128e-08, 3.3151e-09, 2.1899e-10],
[ 1.2152e-08, 7.7002e-09, 1.6406e-08, 1.4948e-08, -6.0882e-10],
[ 6.9930e-09, 7.3222e-09, -7.4308e-10, 5.2505e-09, 3.4365e-09]]]],
grad_fn=<PreluBackward>)
Amazing question (and welcome to StackOverflow)! Research paper for quick reference.
TLDR
Try wider networks (64 channels)
Add Batch Normalization after activation (or even before, shouldn't make much difference)
Add residual connections (shouldn't improve much over batch norm, last resort)
Please check this out in this order and give a comment what (and if) any of that worked in your case (as I'm also curious).
Things you do differently
Your neural network is very deep, yet very narrow (81 parameters per layer only!)
Due to above, one cannot reliably create those weights from normal distribution as the sample is just too small.
Try wider networks, 64 channels or more
You are trying much deeper network than they did
Section: Comparison Experiments
We conducted comparisons on a deep but efficient model with 14 weight
layers (actually 22 was also tested in comparison with Xavier)
That was due to date of release of this paper (2015) and hardware limitations "back in the days" (let's say)
Is this normal?
Approach itself is quite strange with layers of this depth, at least currently;
each conv block is usually followed by activation like ReLU and Batch Normalization (which normalizes signal and helps with exploding/vanishing signals)
usually networks of this depth (even of depth half of what you've got) use also residual connections (though this is not directly linked to vanishing/small signal, more connected to degradation problem of even deep networks, like 1000 layers)

How to measure ONLY the inference time in the GPU, using TensorRT and PyCUDA?

I want to measure ONLY the inference time in the Jetson TX2. How can I improve my function to do that? As right now I am measuring:
the transfer of the image from CPU to GPU
transfer of results from GPU to CPU
the inference
Or is that not possible because of the way GPUs work? I mean, how many times will I have to use stream.synchronize() if I divide/segment the function into 3 parts:
transfer from CPU to GPU
Inference
transfer from GPU to CPU
Thank you
CODE IN INFERENCE.PY
def do_inference(engine, pics_1, h_input, d_input, h_output, d_output, stream, batch_size):
"""
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input: Input in the host (CPU).
d_input: Input in the device (GPU).
h_output: Output in the host (CPU).
d_output: Output in the device (GPU).
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.
Output:
The list of output images.
"""
# Context for executing inference using ICudaEngine
with engine.create_execution_context() as context:
# Transfer input data from CPU to GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
#context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])
# Transfer predictions back from the GPU to the CPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream.
stream.synchronize()
# Return the host output.
out = h_output
return out
CODE IN TIMER.PY
for i in range (count):
start = time.perf_counter()
# Classification - calling TX2_classify.py
out = eng.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1)
inference_time = time.perf_counter() - start
print("TIME")
print(inference_time * 1000)
print("\n")
pred = postprocess_inception(out)
print(pred)
print("\n")

Resources