Zombie process does not allow deallocating GPU memory

Zombie process does not allow deallocating GPU memory - pytorch

I am loading NLP models in GPU to do inferencing.
But once the inference is over the GPU does not deallocate its memory:
But then the command ps -a | grep python gave me
How do I solve this issue?

I'm having a similar problem, a pytorch process on the GPU became zombie and left GPU memory used. Furthermore, in my case the process showed 100% usage in the GPU (GPU-util in the nvidia-smi output). The only solution I have found so far is rebooting the system.
In case you want to try other solutions, I tried before rebooting (without succeed):
Killing the parent of the zombie process: see this answer. After this, the child zombie process became child of init (pid=1). init should reap zombie processes automatically, but this did not happen in my case (the process could still be found with ps, and the gpu memory was not freed).
Sending SIGCHLD to init (command: kill -17 1), to force reaping, but init still did not reap the process, and the gpu memory remained being used.
As suggested by this answer, I checked other child processes that may be related and using the GPU: fuser -v /dev/nvidia*, but no other python processes were found in my case (other than the original zombie process).
As suggested in this issue, killing processes that are accessing /dev/nvidia0, by running fuser -k /dev/nvidia0. This did not affect the zombie process.
Clearing the gpu with nvidia-smi: nvidia-smi --gpu-reset -i <device>, but this throwed device is currently being used by one or more other processes... Please first kill all processes using this device...
At the end, the only solution was rebooting the system.
I'm not sure what caused the error in the first place. I had a pytorch script training in a single GPU, and I have used the same script many times without issue. I used a Dataloader using num_workers=5, which I suspect may have been the culprit, but I cannot be sure. The process suddenly just hang, without throwing an exception or anything, and left the GPU unusable.
I'm using versions: pytorch 1.7.1+cu110, nvidia-driver 455.45.01, running in Ubuntu 18.04

I killed all python processes (pkill python), and zombies are no more on the GPU. I was using torch.

Related

how else can I speed up the process of a third-party application?

I had a task running in pgadmin for a long time, by pid I raised its priority for the scheduler to speed it up:
`sudo renice 20 -p pid`
she began to eat a little more resources (+ 3%), but this did not give a tangible acceleration. Pid I found like this:
ps alx | grep pgadmin
and chose the one with the longest execution time
Then he raised the priority to all processes that are associated with pgadmin. But all these actions did not give a tangible effect.
I have Ubuntu 22.04 LTS if anyone cares.\
how else can I speed up the process of a third-party application?

Can a task that is killed incorrectly on Linux be considered a memory leak?

I use a Raspberry Pi [and run Ubunutu Server 20.04 LTS] quite often so it is advantageous to use memory as responsibly as possible. That being said, I run a number of processes that seem to run fairly efficiently with the 4GB of available memory at about ~2GB. Eventually, though, the memory usage seems to grow closer and closer to the 4GB level. While investgating memory usage with HTOP, I noticed something with the Python scrips I'm running (I've provided an image of what I'm describing); the processes seem to stack up.
Could this be because I'm using CTRL + Z rather than CTRL + C to restart my Python script?
Please let me know if I can be more specific.

Yes it's because you use ctrl-z. Use ctrl-c to interrupt your processes, by sending them SIGINT.
ctrl-z only puts your process into the background. It will keep running until it needs terminal input, then pause.
Try this when running some terminal program on your rPi. (It works with vi and many other programs.)
Press ctrl-z
Then do some shell commands. ls or whatever
Then type fg to resume your suspended process.
Believe it or not, this stuff works exactly the same on my rPi running GNU/Linux as it did on Bell Labs UNIX Seventh Edition on a PDP 11/70 back in 1976. But that computer had quite a bit less RAM.

Using linux not to crash but freeze as in windows

i have an script that is taking a lot of memory to use and it is just "killed" or cannot be assigned more memory. When i run that same script on windows it consumes my whole memory and freezes my pc. I want the same to happen in my linux server instead of killing the process.
I have tried changinc vm.overcommit_memory to 0, 1 and 2 but none of them work, i tried some other things like disabling the oom killer in linux but cant find the value vm.oom-killer, please help
update:
Another solution would be to limit the memory it is using for example 10gb but if it exceedes dont consume more or kill the process, just let if finish.

PyTorch code stops with message "Killed". What killed it?

I train a network on GPU with Pytorch. However, after at most 3 epochs, code stops with a message :
Killed
No other error message is given.
I monitored the memory and gpu usage, there were still space during the run. I reviewed the /var/sys/dmesg to find a detailed message regarding to this, however no message with "kill" was entered. What might be the problem?
Cuda version: 9.0
Pytorch version: 1.1.0

If you had root access, you could check whether this is memory issue or not by dmesg command.
In my case, the process was killed by kernel due to out of memory.
I found the cause to be saving tensors require grad to a list and each of those stores an entire computation graph, which consumes significant memory.
I fixed the issue by saving .detach() tensor instead of saving tensors returned by loss function to the list.

You can type "dmesg" on your terminal and scroll down to the bottom. It will show you the message of why it is killed.
Since you mentioned PyTorch, the chances are that your process is killed due to "Out of Memory". To resolve this, reduce your batch size till you no longer see the error.
Hope this helps! :)

In order to give an idea to people who will enconter this:
Apparently, Slurm was installed on the machine so that I needed to give the tasks on Slurm.

How to prevent long-running backup job from being killed

How do you prevent a long-running memory-intensive tar-based backup script from getting killed?
I have a cron job that runs daily a command like:
tar --create --verbose --preserve-permissions --gzip --file "{backup_fn}" {excludes} / 2> /var/log/backup.log
It writes to an external USB drive. Normally the file generated is 100GB, but after I upgraded to Ubuntu 16, now the log file shows the process gets killed about 25% of the way through, presumably because it's consuming a lot of memory and/or putting the system under too much load.
How do I tell the kernel not to kill this process, or tweak it so it doesn't consume so many resources that it needs to be killed?

If you are certain about the fact that - the gets killed due to consuming too much memory, then you can try increasing the swappiness value in /proc/sys/vm/swappiness. By increasing swappiness you might able to get away from this scenario. You can also try tuning oom_kill_allocating_task, default is 0 , which tries to find out the rouge memory-hogging task and kills that one. If you change that one to 1, oom_killer will kill the calling task.
If none of the above works then you can try oom_score_adj under /proc/$pid/oom_score_adj. oom_score_adj accepts value range from -1000 to 1000. Lower the value less likely to be killed by oom_killer. If you set this value to -1000 then it disables oom killing. But, you should know what exactly you are doing.
Hope this will give you some idea.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string