I noticed that when I break a program multiple times (written in Pytorch with cuda enabled), the GPU memory does not get flushed. Is there any way to do it in such instances when the script needs to be interrupted several times without finishing its execution?
Related
This question can be viewed related to my other question.
I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
As mentioned in this answer,
...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).
For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.
I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.
This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?
This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.
Since only one context runs at a single time instant, why does this cause memory issue?
Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.
If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.
It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.
I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like a backward pass to synchronize gradients. I used a 4-GPU machine and use environment variable CUDA_VISIBLE_DEVICES to make sure only 2 GPU processes can be started. If only 2 GPU processes started, I assume that at the end of the first batch, the synchronization on the existing 2 GPUS would wait on the other two and time out as the other two never started. What I observed is that the training continued with only 2 GPU processes, even though the world size is 4. How to explain this? Is my understanding not correct?
I am working on an operating system of Ubuntu Linux16.04, Compiler is python3.5.2.
In my code, I have generated more than 7 threads using the Threading class of python. Now, the scenario is that my RAM memory increases continuously when code runs. I am got the Process ID of my code from (os.getpid()) functions and using psutil library in python and ps command in Linux, I got the memory usage of my Process(including all threads) but I didn't recognize the memory affected by a particular thread.
when I run my python program, one process is created with several lightweight processes in Linux, Lightweight process ID and number of Lightweight processes is known by ps commands, but still, I did not find any method or tool which can give me memory affected by particular lightweight process(apart from the Main process) in Linux and in python also.
Is there possible to know Thread affected memory apart from Main Process memory in Linux or python?
Thank you in advance.
I have a matlab processing script located in the middle of a long processing pipeline running on linux.
The matlab script applies the same operation to a number N of datasets D_i (i=1,2,...,N) in parallel on (8 cores) via parfor.
Usually, processing the whole dataset takes about 2hours (on 8 cores).
Unfortunately, from time to time, looks like one of the matlab subprocesses crashes randomly. This makes the job impossible to complete (and the pipeline can't finish).
I am sure this does not depend on the data as if I reprocess specifically the D_i on which the process crashes, it is executed without problems. Moreover, up to now I've processed already thousands of the mentioned dataset.
How I deal with the problem now (...manually...):
After I start the matlab job, I periodically check the process list on the machine (via a simple top); whenever I have one matlab process alive after two hours of work, then I know for sure that it has crashed. Then I simply kill it and process the part of the dataset which has not been analyzed.
Question:
I am looking for suggestion on how to timeout ALL the matlab processes running and kill them whenever they are alive for more than e.g. 2hrs CPU.
You should be able to do this by restructuring your code to use PARFEVAL instead of PARFOR. There's a simple example in this entry on Loren's blog: http://blogs.mathworks.com/loren/2013/12/09/getting-data-from-a-web-api-in-parallel/ which shows how you can stop waiting for work after a given amount of time.
We wrote a very simple C++ program to isolate a bug. The app takes a number as an argument and creates that number of threads and sends all those threads into an event loop. If we run app with >3 threads (including main thread) top shows it taking 100+MB in virtual memory. However, if we run it with <=3 threads, it runs with about 36MB virtual memory. We strace'd the app and found out that in the first scenario there is a mmap of about 65MB that is mapped anonymously that does not get unmapped. The problem is the memory usage goes up as the number of threads go up. And we have a large number of binaries which have large number of threads so there seems to be a lot of wasted space. Why does this happen? SLES11 64bit.
Each thread gets by default a stack of around 8Mb. You can set the default when you create a thread with pthread_attr_setstacksize. Make sure you are always either: pthread_join()'ing threads that have ended. Or; create them as detached threads, otherwise you'll leak memory when a thread ends.
Having a big virtual memorry usage is usually not a problem though, unless you really are using all that space, it's just virtual memory - and you'll hardly run out of that on a 64 bit machine.