I have 2 GPUs on different computers. One (NVIDIA A100) is on a server, the other (NVIDIA Quadro RTX 3000) is on my laptop. I watch the performance on both machines via nvidia-smi and noticed that the 2 GPUs use different amounts of memory when running the exact same processes (same code, same data, same CUDA version, same pytorch version, same drivers). I created a dummy script to verify this.
import torch
device = torch.device("cuda:0")
a = torch.ones((10000, 10000), dtype=float).to(device)
In nvidia-smi I can see how much memory is used for this specific python script:
A100: 1205 MiB
RTX 3000: 1651 MiB
However, when I query torch about memory usage I get the same values for both GPUs:
reserved = torch.cuda.memory_reserved(0)
allocated = torch.cuda.memory_allocated(0)
Both systems report the same usage:
reserved = 801112064 bytes (763 MiB)
allocated = 800000000 bytes (764 MiB)
I note that the allocated amount is much less than what I see used in nvidia-smi, though 763 MiB is equal to 100E6 float64 values.
Why does nvidia-smi report different memory usage on these 2 systems?
Related
I'm trying to run a reinforcement learning algorithm using pytorch, but it keeps telling me that CUDA is out of memory. However, it seems that pytorch is only accessing a tiny amount of my GPU's memory.
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 4.00 GiB total capacity; 3.78 MiB already allocated; 0 bytes free; 4.00 MiB reserved in total by PyTorch)
It's not that PyTorch is only accessing a tiny amount of GPU memory, but your PyTorch program accumulatively allocated tensors to the GPU memory, and that 2 MB tensor hits the limitation. Try to use a lower batch size or run the model with half-precision to save the GPU memory.
This command should work for PyTorch to access the GPU memory.
> import os os.environ["CUDA_VISIBLE_DEVICES"] = "1"
there is a Memory leak when using pipe of en_core_web_trf model, I run the model using GPU with 16GB RAM, here is a sample of the code.
!python -m spacy download en_core_web_trf
import en_core_web_trf
nlp = en_core_web_trf.load()
#it's just an array of 100K sentences.
data = dataload()
for index, review in enumerate( nlp.pipe(data, batch_size=100) ):
#doing some processing here
if index % 1000: print(index)
this code cracks when reaching 31K, and raises OOM error.
CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 11.17 GiB total capacity; 10.44 GiB already allocated; 832.00 KiB free; 10.72 GiB reserved in total by PyTorch)
I just use the pipeline to predict, not train any data or other stuff and tried with different batch sizes, but nothing happened,
still, crash.
Your Environment
spaCy version: 3.0.5
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
Pipelines: en_core_web_trf (3.0.0)
Lucky you with GPU - I am still trying to get thru the (torch GPU) DLL Hell on Windows :-). But it looks like Spacy 3 uses more GPU memory than Spacy 2 did - my 6GB GPU may have become useless.
That said, have you tried running your case without the GPU (and watching memory usage)?
Spacy 2 'leak' on large datasets is (mainly) due to growing vocabulary - each data row may add couple more words, and the suggested 'solution' is reloading the model and/or just the vocabulary every nnn rows. The GPU usage may have the same issue...
I made a simple test using PyTorch which involves measuring the current GPU memory use, creating a tensor of a certain size, moving it to the GPU, and measuring the GPU memory again. According to my calculation, around 6.5k bytes were required to store each element of the tensor! Here is the break down:
GPU memory use before creating the tensor as shown by nvidia-smi: 384 MiB.
Create a tensor with 100,000 random elements:
a = torch.rand(100000)
Transfer the tensor to the GPU:
device = torch.device('cuda')
b = a.to(device)
GPU memory use after the transfer: 1020 MiB
Calculate memory change per one element of the tensor
(1020-384)*1024*1024/len(b)
# Answer is 6668.94336
This is weird, to say the least. Why would 6.5 KiB of GPU memory be required to store a single float32 element?
UPDATE: Following Robert Crovella's suggestion in the comment, I created another tensor c and then moved it to the CUDA device on d. The GPU memory usage didn't increase. So, it seems that PyTorch or CUDA requires some 636 MiB for bootstrapping. Why is that? What is this memory used for? It seems a lot to me!
I am trying to limit the gpu memory usage to exactly 10% of the gpu memory, but according to nvidia-smi the below program uses about 13% of the gpu. Is this expected behavior? If it is expected behavior, what is the other approximately 3-4% coming from?
from time import sleep
i = tf.constant(0)
x = tf.constant(10)
r = tf.add(i,x)
# Use at most 10% of gpu memory, I expect this to set a hard limit
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=.1)
# sleep is used to see what nvidia-smi says for gpu memory usage,
# I expect that it will be at most 10% of gpu memory (which is 1616.0 mib for my gpu)
# but instead I see the process using up to 2120 mib
with tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) as sess:
sess.run(r);
sleep(10)
See this github issue for more details about my environment and gpu: https://github.com/tensorflow/tensorflow/issues/22158
From my experimentation, it looks like cudnn and cublas context initialization take around 228 mb of memory. Also, the cuda context can take from 50 to 118 mb.
Is there a way to force a maximum value for the amount of GPU memory that I want to be available for a particular Pytorch instance? For example, my GPU may have 12Gb available, but I'd like to assign 4Gb max to a particular process.
Update (04-MAR-2021): it is now available in the stable 1.8.0 version of PyTorch. Also, in the docs
Original answer follows.
This feature request has been merged into PyTorch master branch. Yet, not introduced in the stable release.
Introduced as set_per_process_memory_fraction
Set memory fraction for a process.
The fraction is used to limit an caching allocator to allocated memory on a CUDA device.
The allowed value equals the total visible memory multiplied fraction.
If trying to allocate more than the allowed value in a process, will raise an out of
memory error in allocator.
You can check the tests as usage examples.
Update pytorch to 1.8.0
(pip install --upgrade torch==1.8.0)
function: torch.cuda.set_per_process_memory_fraction(fraction, device=None)
params:
fraction (float) – Range: 0~1. Allowed memory equals total_memory * fraction.
device (torch.device or int, optional) – selected device. If it is None the default CUDA device is used.
eg:
import torch
torch.cuda.set_per_process_memory_fraction(0.5, 0)
torch.cuda.empty_cache()
total_memory = torch.cuda.get_device_properties(0).total_memory
# less than 0.5 will be ok:
tmp_tensor = torch.empty(int(total_memory * 0.499), dtype=torch.int8, device='cuda')
del tmp_tensor
torch.cuda.empty_cache()
# this allocation will raise a OOM:
torch.empty(total_memory // 2, dtype=torch.int8, device='cuda')
"""
It raises an error as follows:
RuntimeError: CUDA out of memory. Tried to allocate 5.59 GiB (GPU 0; 11.17 GiB total capacity; 0 bytes already allocated; 10.91 GiB free; 5.59 GiB allowed; 0 bytes reserved in total by PyTorch)
"""
In contrast to tensorflow which will block all of the CPUs memory, Pytorch only uses as much as 'it needs'. However you could:
Reduce the batch size
Use CUDA_VISIBLE_DEVICES=# of GPU (can be multiples) to limit the GPUs that can be accessed.
To make this run within the program try:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"