When I use nvidia-smi, I found nearly 20GB GPU Memory is missing somewhere (total listed processes took 17745MB, meanwhile Memory-Usage is 37739MB):
Then I use nvitop, you can see No Such Process has actually taken my GPU resources. However, I cannot kill this PID:
>>> sudo kill -9 118238
kill: (118238): No such process
How can I get rid of this ghost process without interupting others?
I have found the solution in this answer: https://stackoverflow.com/a/59431785/6563277.
First, I run sudo fuser -v /dev/nvidia* to see all processes are using my GPU RAM that nvidia-smi has failed to show.
Then, I saw some "ghost" Python processes. And after killing it, the GPU RAM was free up.
Related
Frequently facing the issue of the kswapd0 running in one of the linux machines, what could be the reason for that, by looking more at the issue, understood that it will be because of the less memory, I tried the below options to avoid it:
echo 1 > /proc/sys/vm/drop_caches
cat /proc/sys/vm/drop_caches
sudo cat /proc/sys/vm/swappiness
sudo sysctl vm.swappiness=60
but it does not yield fruitful results, what could be the best method to avoid it, or its something some action needs to be taken on the RAM memory of the machine, Any suggestions on this ?
Every time we observe , all the running apps are killed automatically and kswapd0 occupies the complete cpu and memory.
This is a follow up question to an earlier question.
From the discussion, the mmc code (https://github.com/fangq/mmc) appears to be fine, and the memory was properly released when running on Intel CPU and AMD GPU. However, on NVIDIA GPU, valgrind reported significant memory leak, so was the test. Every time after a cycle of creating and releasing a GPU context, the memory kept increasing.
You can see this result in the below memory (blue line) profiling report.
Here is the test and commands to reproduce the issue (need to run this on NVIDIA GPUs):
git clone https://github.com/fangq/mmc.git
cd mmc/src
sed -i -e 's/mmc_init_from_cmd/for(int i=0;i<5;i++){\nmmc_init_from_cmd/g' mmc.c
sed -i -e 's/return/getchar();}\nreturn/g' mmc.c
make clean
make all
cd ../examples/validation
../../src/bin/mmc -f cube2.inp -G 1 -s cube2 -n 1e4 -b 0 -D TP -M G -F bin
run ../../src/bin/mmc -L to list GPUs, use -G # to specify which GPU to use.
as you will see, the simulation will repeat 5 times, separated by enter keys. You can start a memory monitor, like top command in Linux, and see the increasing memory allocation after each repetition.
I googled and found multiple previous reports on OpenCL memory leaks, but I did not find an solution. I would like to know if if there any trick to force NVIDIA OpenCL driver to clean up memory after each run. I am asking this because mmc has a MATLAB/Octave mex function which can be called multiple times, and this issue could lead to large memory usage after multiple calls.
I'm running my execuatable through mpirun command. I want to get the pid of processes being created so that I could kill them later if required. I'm using MPICH. In openmpi there is an option -report-pid which gives pids. Is there anything similar in MPICH?
nvidia-smi screenshot
The process with PID 14420 is a zombie process and its parent id is 1(init). I want to clear 4436MiB memory occupied by this zombie process without rebooting.
How should I proceed?
Use this command to list all the processes on the GPU (on Linux).
sudo fuser -v /dev/nvidia*
and find PID from the listed processes and simply use
kill <PID> .
This will kill process with specified ID.
I have centos image in virtualbox. When I do curl [url] | tee -a [file] where [url] is the url for a large file, the system start to kill all new proccesses and I get Killed answer in console for any command but kill and cd. How can I disable OOM daemon?
The OOM Killer is your friend, why would you want to disable it? When the system is running out of memory, the kernel must start killing processes in order to stay operational. So lets be honest, you need the OOM Killer.
Instead, you might consider configuring the OOM Killer with some configuration that suits your needs; yet your current problems may persist.
In the light of the facts, it may be better to implement a more efficient way of doing these tasks you are doing.
If you don't like "your friend", the OOM killer, to kill innocent processes, a short answer is:
sysctl -w vm.overcommit_memory=2
More verbose answers and recommended reading:
Effects of configuring vm.overcommit_memory
How to disable the oom killer in linux?
Turn off the Linux OOM killer by default?