What is the best way to avoid slurm killing a process due to OOM error without running the process multiple times to test out different memory constraints? Is there a soft memory limit that I can set for slurm to dynamically allocate more memory?
The best thing I came up with is to use a large memory limit and allow the process to share the resources, but I was wondering if there are better ways to prevent a process from being killed by OOM error.
Related
When you set a cpu limitation for a process and that process creates several other child processes, as the child processes increase their CPU-shares, do the CPU-shares of the parent process increase as well ?
The same question would go for memory. Although, memory is different as far as I am concerned since I've learned that each process has its own heap. Would it then be correct to say that the memory limitation of a process isn't influenced by the amount of memory its child processes use ?
This question can be viewed related to my other question.
I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
As mentioned in this answer,
...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).
For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.
I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.
This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?
This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.
Since only one context runs at a single time instant, why does this cause memory issue?
Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.
If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.
It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.
I just learned about mlock() functions. I know that it allows you to lock program memory into RAM (allowing the physical address to change but not allowing the memory to be evicted). I've read that newer Linux kernel versions have a mlock limit (ulimit -l), but that this is only applicable to unprivileged processes. If this is a per-process limit, could an unprivileged process spawn a ton of processes by fork()-ing and have each call mlock(), until all memory is locked up and the OS slows to a crawl because of tons of swapping or OOM killer calls?
It is possible that an attacker could cause problems with this, but not materially more problems than they could cause otherwise.
The default limit for this on my system is about 2 MB. That means a typical process won't be able to lock more than 2 MB of data into memory. Note that this is just normal memory that won't be swapped out; it's not an independent, special resource.
It is possible that a malicious process could spawn many other processes to use more locked memory, but because a process usually requires more than 2 MB of memory to run anyway, they're not really exhausting memory more efficiently by locking it; in fact, starting a new process itself is actually going to more effective at using memory than locking it. It is true that a process could simply fork, lock memory, and sleep, in which case its other pages would likely be shared because of copy-on-write, but it could just also allocate a decent chunk of memory and cause many more problems, and in fact it will generally have permission to do so since many processes require non-trivial amounts of memory.
So, yes, it's possible that an attacker could use this technique to cause problems, but because there are many easier and more effective ways to exhaust memory or cause other problems, this seems like a silly way to go about doing it. I, for one, am not worried about this as a practical security problem.
What are some strategies to work around or debug this?
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66 GB
Basically, am just running lots of parallel jobs on single machine but but a dask-scheduler and have tried various numbers of workers. Any time I launch a large number of jobs the memory gradually creeps up over time and only goes down when I bounce the cluster.
I am trying to use fire_and_forget. Will .release() the futures help? I am typically launching these tasks via client.submit from the REPL and then terminating the REPL.
Would be happy to occasionally bounce workers and add some retry patterns if that is the correct way to use dask with leaky libraries.
UPDATE:
I have tried limited worker memory to 2 GB, but am still getting this error. When the error happens it seems to go into some sort of unrecoverable loop continually printing the error and no compute happens.
Dask isn't leaking the memory in this case. Something else is. Dask is just telling you about it. Something about the code that you are running with Dask seems to be leaking something.
We're using SLURM to manage job scheduling on our computing cluster, and we experiencing a problem with memory management. Specifically, we can't find out how we can allocate memory for a specific task.
Consider the following setup:
Each node has 32GB memory
We have a SLURM job that sets --mem=24GB
Now, assume we want to run that SLURM job twice, concurrently. Then what I expect (or want) to happen is that when I queue it twice by calling sbatch runscript.sh twice, one of the two jobs will run on one node, and the other will run on another node. However, as it currently is, SLURM schedules both tasks on the same node.
One of the possible causes we've identified is that it appears to check only whether the 24GB of memory is available (i.e., not actively used by other node), instead of checking whether it is requested/allocated.
The question here is: is it possible to allocate/reserve memory per task in SLURM?
Thanks for your help!
In order to be able to manage memory slurm needs the parameter in SchedTypeParameters to include MEMORY. So just changing that parameter to CR_Core_Memory should be enough for Slurm to start to manage the memory.
If that is not set --mem will not reserve memory and only ensure that the node has enough memory configured.
More information here
#CarlesFenoy's answer is good, but to answer
The question here is: is it possible to allocate/reserve memory per
task in SLURM?
the parameter you are looking for is --mem-per-cpu, to use in combination with --cpus-per-task