I have a mac book pro with a 2 Gb Nvidia GPU. I am trying to utilize all my GPU memory for computations (python code). How much saving I may get if I bypassed the GUI interface and only accessed my machine through command line. I want to know if such a thing would save me a good amount of GPU memory?
The difference probably won't be huge.
A GPU that is only hosting a console display will only typically have ~5-25 megabytes of memory reserved out of the total memory size. On the other hand, a GPU that is hosting a GUI display (using the NVIDIA GPU driver) might typically have ~50 megabytes or more reserved for display use (this may vary somewhat based on the size of the display attached).
So you can probably get a good "estimate" of the savings by running nvidia-smi and looking at the difference between total and available memory for your GPU with the GUI running. If that is, for example, 62MB, then you can probably "recover" around 40-50MB by shutting off the GUI, for example on linux switching to runlevel 3.
I just ran this experiment on a linux laptop with a Quadro3000M that happens to have 2GB of memory. With the X display up and NVIDIA GPU driver loaded, the "used" memory was 62MB out of 2047MB (reported by nvidia-smi).
When I switch to runlevel 3 (X is not started) the memory usage dropped to about 4MB. That probably means that ~50MB additional is available to CUDA.
A side benefit of shutting off the GUI might be the elimination of the display watchdog.
Related
I made a game (using Unity) which is powered by two open-source machine learning models, VQGAN-CLIP and GPT-NEO (converted to exe via pyinstaller), simultaneously running.
VQGAN-CLIP generates images in the background while GPT-NEO needs to generate text periodically. When GPT-NEO gets called while VQGAN-CLIP is generating something, MSI Afterburner shows the VRAM usage exceeds the maximum, or was already at maximum, and I get a whole system freeze for about 1 second while the VRAM swaps (I confirmed that it only happens if both are running simultaneously, not when only 1 of them is running). I am using the RTX 2060S with 8GB of VRAM.
Also I have 64 GB ram and the usage is nowhere near there so it's only swapping to RAM rather than SSD.
It is not feasible to block GPT-NEO until VQGAN-CLIP finishes generating the image, because it could take 30-60 seconds to generate 1 image which is too long to wait in between turns.
My question is: Is this absolutely inevitable or is there a workaround to swap the VRAM without freezing the whole system, or better yet the game itself?
I need to create variables directly on the GPU because I am very limited in my CPU ram.
I found the method to do this here
https://discuss.pytorch.org/t/how-to-create-a-tensor-on-gpu-as-default/2128/3
Which mentions using
torch.set_default_tensor_type(‘torch.cuda.FloatTensor’)
However, when I tried
torch.set_default_tensor_type('torch.cuda.FloatTensor')
pytorchGPUDirectCreate = torch.FloatTensor(20000000, 128).uniform_(-1, 1).cuda()
It still seemed to take up mostly CPU RAM, before being transferred to GPU ram.
I am using Google Colab. To view RAM usage during the variable creation process, after running the cell, go to Runtime -> Manage Sessions
With and without using torch.set_default_tensor_type('torch.cuda.FloatTensor') , the CPU RAM bumps up to 11.34 GB while GPU ram stays low, and then GPU RAM goes to 9.85 and CPU ram goes back down.
It seems that torch.set_default_tensor_type(‘torch.cuda.FloatTensor’) didn't make a difference
For convenience here's a direct link to a notebook anyone can directly run
https://colab.research.google.com/drive/1LxPMHl8yFAATH0i0PBYRURo5tqj0_An7
Background:
I was trying to setup a ubuntu machine on my desktop computer. The whole process took a whole day, including installing OS and softwares. I didn't thought much about it, though.
Then I tried doing my work using the new machine, and it was significantly slower than my laptop, which is very strange.
I did iotop and found that disk traffic when decompressing a package is around 1-2MB/s, and it's definitely abnormal.
Then, after hours of research, I found this article that describes exactly same problem, and provided a ugly solution:
We recently had a major performance issue on some systems, where disk write speed is extremely slow (~1 MB/s — where normal performance
is 150+MB/s).
...
EDIT: to solve this, either remove enough RAM, or add “mem=8G” as kernel boot parameter (e.g. in /etc/default/grub on Ubuntu — don’t
forget to run update-grub !)
I also looked at this post
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
and did
cat /proc/vmstat | egrep "dirty|writeback"
output is:
nr_dirty 10
nr_writeback 0
nr_writeback_temp 0
nr_dirty_threshold 0 // and here
nr_dirty_background_threshold 0 // here
those values were 8223 and 4111 when mem=8g is set.
So, it's basically showing that when system memory is greater than 8GB (32GB in my case), regardless of vm.dirty_background_ratio and vm.dirty_ratio settings, (5% and 10% in my case), the actual dirty threshold goes to 0 and write buffer is disabled?
Why is this happening?
Is this a bug in the kernel or somewhere else?
Is there a solution other than unplugging RAM or using "mem=8g"?
UPDATE: I'm running 3.13.0-53-generic kernel with ubuntu 12.04 32-bit, so it's possible that this only happens on 32-bit systems.
If you use a 32 bit kernel with more than 2G of RAM, you are running in a sub-optimal configuration where significant tradeoffs must be made. This is because in these configurations, the kernel can no longer map all of physical memory at once.
As the amount of physical memory increases beyond this point, the tradeoffs become worse and worse, because the struct page array that is used to manage all physical memory must be kept mapped at all times, and that array grows with physical memory.
The physical memory that isn't directly mapped by the kernel is called "highmem", and by default the writeback code treats highmem as undirtyable. This is what results in your zero values for the dirty thresholds.
You can change this by setting /proc/sys/vm/highmem_is_dirtyable to 1, but with that much memory you will be far better off if you install a 64-bit kernel instead.
Is this a bug in the kernel
According to the article you quoted, this is a bug, which did not exist in earlier kernels, and is fixed in more recent kernels.
Note that this issue seems to be fixed in later releases (3.5.0+) and is a regression (doesn’t happen on e.g. 2.6.32)
I am trying to hunt down a possible memory leak in my Sharpdx / DirectX application.
I am getting the following information from process explorer which I do not know how to interpret.
What is Dedicated GPU Memory?
What is System GPU Memory?
What is Comitted GPU Memory?
Dedicated GPU memory is basically the VRAM on-board the GPU
System GPU memory is memory that the graphics card driver is using the GART (Graphics Address Remapping Table) to store resources in system memory... AGP and PCI Express both provide regions of memory set aside for this purpose (sometimes referred to as aperture segments).
Committed GPU memory refers to the amount of memory mapped into a display device's address space by the display driver, it is a difficult concept to explain but this number typically does not represent anything worthwhile to anyone but driver developers.
I suggest you look into the following documentation on MSDN as well as this overview of GPU address space segementation, while they are somewhat technical they give a general overview of what is going on.
Today, i received few alerts about swapping activity of 3000KB/sec. This linux box have very few process running and has total 32GB of RAM. When i logged and did execute free, i did not see any thing suspicious. The ratio of free memory to used in (buffers/cache) row, was high enough(25GB in free, to 5GB in usage).
So i am wondering what are main causes of paging on linux system
How does swappiness impact paging
How long does a page stay in Physical RAM before it is swapped out. What controls this behavior on linux?
Is it possible that even if there is adequate free physical RAM, but a process's memory access pattern is such that data is spread over multiple pages. Would this cause paging?
For example consider a 5GB array, such that the program access 5GB in a loop, although slowly, such that pages which are not used , are swapped out. Again, keep in mind, that even if buffer is 5GB, there could be 20GB of physical RAM available.
UPDATE:
The linux vendor is RHEL 6.3, kernel version 2.6.32-279.el6.x86_64