PyTorch Pre-Allocation to avoid OOM does not work

PyTorch Pre-Allocation to avoid OOM does not work - linux

So, I am trying to finetune FCoref using the trainer in https://github.com/shon-otmazgin/fastcoref
This uses a Dynamic Batching with variable length and this creates an issue on CUDA because once PyTorch allocates memory for the first batch, it does not increase it.
So, following this guide here: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#pre-allocate-memory-in-case-of-variable-input-length
I added this to my code and I call it before running the actual training (right after creating the model and moving it to CUDA):
batch = {
"input_ids": torch.rand(9, 5, 512),
"attention_mask": torch.rand(9, 5, 512),
"gold_clusters": torch.rand(9, 58, 39, 2),
"leftovers": {
"input_ids": torch.rand(4),
"attention_mask": torch.rand(4),
}
}
batch['input_ids'] = torch.tensor(batch['input_ids'], device=self.device)
batch['attention_mask'] = torch.tensor(batch['attention_mask'],
device=self.device)
batch['gold_clusters'] = torch.tensor(batch['gold_clusters'],
device=self.device)
if 'leftovers' in batch:
batch['leftovers']['input_ids'] = torch.tensor(
batch['leftovers']['input_ids'], device=self.device)
batch['leftovers']['attention_mask'] = torch.tensor(
batch['leftovers']['attention_mask'],
device=self.device)
self.model.zero_grad()
self.model.train()
with torch.cuda.amp.autocast():
outputs = self.model(batch, gold_clusters=batch['gold_clusters'],
return_all_outputs=False)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
loss.backward()
At first, I was getting OOM issues with this because it was too big (I basically created the biggest tensors in each key according to my dataset).
So, instead, I created a batch that looks like my biggest batch in the actual data (according to the sum of tensor sizes):
batch = {
"input_ids": torch.rand(4, 1, 512),
"attention_mask": torch.rand(4, 1, 512),
"gold_clusters": torch.rand(4, 11, 24, 2),
"leftovers": {
"input_ids": torch.rand(4, 459),
"attention_mask": torch.rand(4, 459),
}
}
Now, this works but when the actual training starts, I run into the same issue even though the first batch is smaller than the pre-allocation batch:
OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 14.56 GiB total capacity; 13.31 GiB already allocated; 36.44 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Other things I tried:
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:21'
Decreasing batch size, but due to the variability I keep running into the same issue.
My machine:
runs Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
T4 GPU with 16 GB VRAM
Any idea?

Related

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

torch cannot allocate small size tensor (< 1GB) on GPU but it can for CPU on the same node with 400+ GB memory on databricks

My question is relevant to my previous one pytorch allocate memory for small size tensor on cpu and gpu but got error on a node with more than 400 GB.
But, it is different so I created a new thread.
In this question, I have changed size of the input tensor size.
import torch
from torch import nn
import numpy as np
num_embedding, num_dim = 14000, 300
embedding = nn.Embedding(num_embedding, num_dim)
row, col = 8000, 302
t = [[x for x in range(col)] for _ in range(row)]
t1 = torch.tensor(t)
print(t1.shape) # torch.Size([8000, 302])
type(t1), t1.device, (t1.nelement() * t1.element_size())/(1024**3) # (torch.Tensor, device(type='cpu'), 0.01800060272216797)
tt = embedding(t1)
embedding.forward(t1)
t2 = t1.cuda()
t2.device, t2.shape, t2.grad, t2.nelement(), t2.element_size(), (t2.nelement() * t2.element_size())/(1024**3) # (device(type='cuda', index=0), torch.Size([8000, 302]), None, 2416000, 8, 0.01800060272216797)
embedding_cuda = embedding.cuda()
torch.cuda.empty_cache()
embedding_cuda(t2) # RuntimeError: CUDA out of memory. Tried to allocate 2.70 GiB (GPU 0; 11.17 GiB total capacity; 7.19 GiB already allocated; 2.01 GiB free; 8.88 GiB reserved in total by PyTorch)
Why the small size tensor (0.018 GB) can be allocated to cpu but cannot be allocated to gpu on the same node (p2.8xlarge) ? why it requires 2.7 GB, which 100 times larger than its original size at least ?
I have checked most posts at https://stackoverflow.com/search?q=RuntimeError%3A+CUDA+out+of+memory.+Tried+to+allocate+GiB
but, none of them can help me about this.

How to initialize empty tensor with certain dimension and append to it through a loop without CUDA out of memory?

I am trying to append tensors (t) generated in a for-loop to a list [T] that accumulates all these tensors. Next, the list [T] requires to be converted into a tensor and needs to be loaded onto GPU.
b_output = []
for eachInputId, eachMask in zip(b_input_ids, b_input_mask):
# unrolled into each individual document
# print(eachInputId.size()) # individual document here
outputs = model(eachInputId,
token_type_ids=None,
attention_mask=eachMask)
# combine the [CLS] output layer to form the document
doc_output = torch.mean(outputs[1], dim=0) # size = [1, ncol]
b_output.append( doc_output )
t_b_output = torch.tensor( b_output )
Another method that I tried was initializing a tensor {T} with fixed dimensions and appending the tensors (t) to it from the for-loop.
b_output = torch.zeros(batch_size, hidden_units)
b_output.to(device) # cuda device
for index, (eachInputId, eachMask) in enumerate(zip(b_input_ids, b_input_mask)):
# unrolled into each individual document
# print(eachInputId.size()) # individual document here
outputs = model(eachInputId,
token_type_ids=None,
attention_mask=eachMask)
# combine the [CLS] output layer to form the document
doc_output = torch.mean(outputs[1], dim=0) # size = [1, ncol]
b_output[index] = doc_output
Doing either of this produces this error:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.65 GiB already allocated; 2.81 MiB free; 10.86 GiB reserved in total by PyTorch)
I assume this is because of appending the tensors (that are on the GPU) to a list (of course not on the GPU) and then trying to convert the list into a tensor (thats not on the GPU).
What could be done to append those tensors to another tensor and then load the tensor to GPU for further processing?
I will be grateful for any hint or information.

Try using torch.cat instead of torch.tensor. You are currently trying to allocate memory for you new tensor while all the other tensors are still stored, which might be the cause of the out of memory error. Change :
t_b_output = torch.tensor( b_output )
with:
t_b_output = torch.cat( b_output )
Hope this help

Python3.5-How to delete (release GPU memory) a variable from inside a function

I defined a function in Python 3.5 called 'evaluate' and the code is shown below ('REC_Y', 'REC_U', 'REC_V' represent the 3 channels of a YCbCr image respectively):
import numpy as np
def evaluate(REC_Y, REC_U, REC_V):
height = 832
width = 480
bufY = np.reshape(np.asarray(REC_Y), (height, width))
bufU = np.reshape(np.asarray(REC_U), (int(height / 2), int(width / 2)))
bufV = np.reshape(np.asarray(REC_V), (int(height / 2), int(width / 2)))
return (np.stack((bufY, bufU, bufV), axis=2))
In order to release some GPU memory (since I already had a GPU MemoryError), I'd like to remove 'REC_Y','REC_U','REC_V' from memory after the last line of the code (after 'bufV = np.reshape(np.asarray(REC_V), (int(height / 2), int(width / 2)))'). I have tried 'del REC_Y', but it shown 'REC_Y' referenced before assignment. I have tried del global()["REC_Y"] but it shown that "REC_Y" is not defined as a global variable.
Could you please help me with this issue? How to delete 3 parameters of 'evaluate' function to release GPU memory?
Many thanks!

Numpy does not work on GPU.
Only if you had CUPY or CUDA operations could you try to free some memory on the GPU -> numpy works on CPU.

Re-trained keras model evaluation leaks memory when called in a loop

In my application, I'm re-using the existing MobileNet trained on ImageNet and re-training the output layers on the flowers dataset with only 5 classes. The re-trained model is saved to disk. Afterwards, the model is loaded and evaluation is executed during several iterations, which eventually leads to memory exhaustion and the whole application crashes. After doing some diagnostics, I realized that the leak is coming from the model.evaluate() keras method. The issue can be reproduced in a standalone sample code:
import os
import resource
import keras
import numpy as np
if __name__ == '__main__':
init_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
for it in range(4):
x_valid = np.random.uniform(0, 1, (64, 224, 224, 3)).astype(np.float32)
y_valid = keras.utils.to_categorical(np.random.uniform(0, 5, (64, )).astype(np.int32), 5)
start_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
model = keras.models.load_model(os.path.abspath(os.path.join('.', 'mobilenet_flowers.h5')),
custom_objects={'relu6': keras.applications.mobilenet.relu6,
'DepthwiseConv2D': keras.applications.mobilenet.DepthwiseConv2D})
loss, _ = model.evaluate(x_valid, y_valid, batch_size=64, verbose=False)
keras.backend.clear_session()
del model
end_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print('Iteration %d:' % it)
print(' Memory alloc before evaluate() is %7d kilobytes' % start_alloc)
print(' Memory alloc after evaluate() is %7d kilobytes' % end_alloc)
print(' Memory alloc loss for evaluate is %7d kilobytes\n' % (end_alloc - start_alloc))
exit_alloc = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print('Memory alloc before loop is %7d kilobytes' % init_alloc)
print('Memory alloc after loop is %7d kilobytes' % exit_alloc)
print('Memory alloc difference is %7d kilobytes' % (exit_alloc - init_alloc))
When I execute the script, the following is printed out:
Iteration 0:
Memory alloc before evaluate() is 251864 kilobytes
Memory alloc after evaluate() is 901696 kilobytes
Memory alloc loss for evaluate is 649832 kilobytes
Iteration 1:
Memory alloc before evaluate() is 901696 kilobytes
Memory alloc after evaluate() is 1036780 kilobytes
Memory alloc loss for evaluate is 135084 kilobytes
Iteration 2:
Memory alloc before evaluate() is 1036780 kilobytes
Memory alloc after evaluate() is 1148692 kilobytes
Memory alloc loss for evaluate is 111912 kilobytes
Iteration 3:
Memory alloc before evaluate() is 1148692 kilobytes
Memory alloc after evaluate() is 1190804 kilobytes
Memory alloc loss for evaluate is 42112 kilobytes
Memory alloc before loop is 138792 kilobytes
Memory alloc after loop is 1190804 kilobytes
Memory alloc difference is 1052012 kilobytes
Any suggestions what may be wrong here? After going through the forums, I tried adding K.clear_session(), but, as you can see in the code, that didn't help. The model is stored temporary at https://ufile.io/rgaxs.
Some additional info about my environment:
== cat /etc/issue ===============================================
Linux 4.10.0-38-generic #42~16.04.1-Ubuntu SMP Tue Oct 10 16:32:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
No
== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== check pips ===================================================
numpy (1.12.1)
numpydoc (0.7.0)
protobuf (3.5.0)
tensorflow (1.4.0)
tensorflow-tensorboard (0.4.0rc3)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.4.0
tf.GIT_VERSION = v1.4.0-rc1-11-g130a514
tf.COMPILER_VERSION = v1.4.0-rc1-11-g130a514
keras.VERSION = 2.0.9

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PyTorch Pre-Allocation to avoid OOM does not work - linux

Related

How long does load_dataset take time in huggingface?

torch cannot allocate small size tensor (< 1GB) on GPU but it can for CPU on the same node with 400+ GB memory on databricks

How to initialize empty tensor with certain dimension and append to it through a loop without CUDA out of memory?

Python3.5-How to delete (release GPU memory) a variable from inside a function

Re-trained keras model evaluation leaks memory when called in a loop

Categories

Resources