Creating Pytorch variables directly on GPU seems to still create variable on CPU first, judging by CPU RAM usage - pytorch

I need to create variables directly on the GPU because I am very limited in my CPU ram.
I found the method to do this here
https://discuss.pytorch.org/t/how-to-create-a-tensor-on-gpu-as-default/2128/3
Which mentions using
torch.set_default_tensor_type(‘torch.cuda.FloatTensor’)
However, when I tried
torch.set_default_tensor_type('torch.cuda.FloatTensor')
pytorchGPUDirectCreate = torch.FloatTensor(20000000, 128).uniform_(-1, 1).cuda()
It still seemed to take up mostly CPU RAM, before being transferred to GPU ram.
I am using Google Colab. To view RAM usage during the variable creation process, after running the cell, go to Runtime -> Manage Sessions
With and without using torch.set_default_tensor_type('torch.cuda.FloatTensor') , the CPU RAM bumps up to 11.34 GB while GPU ram stays low, and then GPU RAM goes to 9.85 and CPU ram goes back down.
It seems that torch.set_default_tensor_type(‘torch.cuda.FloatTensor’) didn't make a difference
For convenience here's a direct link to a notebook anyone can directly run
https://colab.research.google.com/drive/1LxPMHl8yFAATH0i0PBYRURo5tqj0_An7

Related

Is there a way to prevent VRAM swapping from stalling the whole system?

I made a game (using Unity) which is powered by two open-source machine learning models, VQGAN-CLIP and GPT-NEO (converted to exe via pyinstaller), simultaneously running.
VQGAN-CLIP generates images in the background while GPT-NEO needs to generate text periodically. When GPT-NEO gets called while VQGAN-CLIP is generating something, MSI Afterburner shows the VRAM usage exceeds the maximum, or was already at maximum, and I get a whole system freeze for about 1 second while the VRAM swaps (I confirmed that it only happens if both are running simultaneously, not when only 1 of them is running). I am using the RTX 2060S with 8GB of VRAM.
Also I have 64 GB ram and the usage is nowhere near there so it's only swapping to RAM rather than SSD.
It is not feasible to block GPT-NEO until VQGAN-CLIP finishes generating the image, because it could take 30-60 seconds to generate 1 image which is too long to wait in between turns.
My question is: Is this absolutely inevitable or is there a workaround to swap the VRAM without freezing the whole system, or better yet the game itself?

In PyTorch, can I load a tensor from file directly to the GPU, without using CPU memory?

I'm working on feature generation before I train a model in PyTorch. I wish to save my features as PyTorch tensors on disk for later use in training.
One of my features ("Feature A") is calculated on a CPU while another feature ("Feature B") must be calculated from that CPU on a GPU (some linear algebra stuff). I have an unusual limitation: on my university cluster, jobs which don't use GPUs have CPU memory limits of 1TB each while jobs which do use GPUs have CPU memory limits of 4GB with GPU memory limits of 48GB. Feature A and Feature B are each approximately 10GB.
Naturally, I want to first calculate Feature A using CPUs only then save Feature A to disk. In another job (this one with GPU access and thus the 4GB CPU memory limitation), I want to load Feature A directly to GPU, compute Feature B, then save Feature B to disk.
With Feature A computed and saved to disk, I've tried:
feaB = torch.load(feaAfile, map_location=torch.device('cuda'))
And yet I max-out my CPU memory. I've confirmed cuda is available.
In the PyTorch documentation I see that in loading tensors they "are first deserialized on the CPU..."
I wonder if there is any way to avoid a CPU memory implication when I want to load only onto the GPU? If the tensor must first be copied to the CPU, could I use some sort of 4GB buffer? Thanks so much in advance.
EDIT: per discussion in the comments, I no longer need to do this. But the question itself, of loading a tensor to the GPU without using CPU memory, remains unanswered so I'm leaving this question up.

Mono 4.2.2 garbage collection really slow/leaking on Linux with multiple threads?

I have an app that processes 3+GB of data into 300MB of data. Run each independent dataset sequentially on the main thread, its memory usage tops out at about 3.5GB and it works fine.
If I run each dataset concurrently on 10 threads, I see the following:
Virtual memory usage climbs steadily until allocations fail and it crashes. I can see GC is trying to run in the stack trace)
CPU utilization is 1000% for periods, then goes down to 100% for minutes, and then cycles back up. The app is easily 10x slower when run with multiple threads, even though they are completely independent.
This is mono 4.2.2 build for Linux with large heap support, running on 128GB RAM with 40 logical processors. I am running mono-sgen and have tried all the custom GC settings I could think of (concurrent mark-sweep, max heap size, etc).
These problems do not happen on Windows. If I rewrite code to do significant object pooling, I get farther in the dataset before running OOM, but the fate is the same. I have verified that I have no memory leaks using multiple tools and good-old printf-debugging.
My best theory is that lots of allocations across lots of threads are a weak case for the GC, and most of that wall-clock time is spent with my work threads suspended.
Does anyone have any experience with this? Is there a way I can help the GC get out of that 100% rut it gets stuck in, and to not run out of memory?

Saving GPU memory by bypassing GUI

I have a mac book pro with a 2 Gb Nvidia GPU. I am trying to utilize all my GPU memory for computations (python code). How much saving I may get if I bypassed the GUI interface and only accessed my machine through command line. I want to know if such a thing would save me a good amount of GPU memory?
The difference probably won't be huge.
A GPU that is only hosting a console display will only typically have ~5-25 megabytes of memory reserved out of the total memory size. On the other hand, a GPU that is hosting a GUI display (using the NVIDIA GPU driver) might typically have ~50 megabytes or more reserved for display use (this may vary somewhat based on the size of the display attached).
So you can probably get a good "estimate" of the savings by running nvidia-smi and looking at the difference between total and available memory for your GPU with the GUI running. If that is, for example, 62MB, then you can probably "recover" around 40-50MB by shutting off the GUI, for example on linux switching to runlevel 3.
I just ran this experiment on a linux laptop with a Quadro3000M that happens to have 2GB of memory. With the X display up and NVIDIA GPU driver loaded, the "used" memory was 62MB out of 2047MB (reported by nvidia-smi).
When I switch to runlevel 3 (X is not started) the memory usage dropped to about 4MB. That probably means that ~50MB additional is available to CUDA.
A side benefit of shutting off the GUI might be the elimination of the display watchdog.

CUDA Process Memory Usage [duplicate]

When I run my CUDA program which allocates only a small amount of global memory (below 20 M), I got a "out of memory" error. (From other people's posts, I think the problem is related to memory fragmentation) I try to understand this problem, and realize I have a couple of questions related to CUDA memory management.
Is there a virtual memory concept in CUDA?
If only one kernel is allowed to run on CUDA simultaneously, after its termination, will all of the memory it used or allocated released? If not, when these memory got free released?
If more than one kernel are allowed to run on CUDA, how can they make sure the memory they use do not overlap?
Can anyone help me answer these questions? Thanks
Edit 1: operating system: x86_64 GNU/Linux
CUDA version: 4.0
Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
Edit 2: The following is what I got after doing some research. Feel free to correct me.
CUDA will create one context for each host thread. This context will keep information such as what portion of memory (pre allocated memory or dynamically allocated memory) has been reserved for this application so that other application can not write to it. When this application terminates (not kernel) , this portion of memory will be released.
CUDA memory is maintained by a link list. When an application needs to allocate memory, it will go through this link list to see if there is continuous memory chunk available for allocation. If it fails to find such a chunk, a "out of memory" error will report to the users even though the total available memory size is greater than the requested memory. And that is the problem related to memory fragmentation.
cuMemGetInfo will tell you how much memory is free, but not necessarily how much memory you can allocate in a maximum allocation due to memory fragmentation.
On Vista platform (WDDM), GPU memory virtualization is possible. That is, multiple applications can allocate almost the whole GPU memory and WDDM will manage swapping data back to main memory.
New questions:
1. If the memory reserved in the context will be fully released after the application has been terminated, memory fragmentation should not exist. There must be some kind of data left in the memory.
2. Is there any way to restructure the GPU memory ?
The device memory available to your code at runtime is basically calculated as
Free memory = total memory
- display driver reservations
- CUDA driver reservations
- CUDA context static allocations (local memory, constant memory, device code)
- CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
- CUDA context user allocations (global memory, textures)
if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.
The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.
GPU off-chip memory is separated in global, local and constant memory. This three memory types are a virtual memory concept. Global memory is free for all threads, local is just for one thread only (mostly used for register spilling) and constant memory is cached global memory (writable only from host code). Have a look at 5.3.2 from the CUDA C Programming Guide.
EDIT: removed
Memory allocated via cudaMalloc does never overlap. For the memory a kernel allocates during runtime should be enough memory available. If you are out of memory and try to start a kernel (only a guess from me) you should get the "unknown error" error message. The driver than was unable to start and/or executes the kernel.

Resources