I am confused on the usage of .to(device) in pytorch. I know that it loads the variables to the GPU. But after that, lets say we multiply two gpu tensors together, will our computer know to use the gpu to do that? Or will it cast it back to CPU and preform the computation? I guess I am just confused on when and how our computer will know to use gpu outside of us telling it to hold some variable in its memory.
All operations on tensors that are e.g. in the GPU memory will be performed on the GPU. Furthermore if multiple tensors are involved in some operation, all of them need to be set to the same .device.
The way this is done is that no matter what operation we do to tensors, we'll always call methods of the torch.Tensor class. So adding two tensors a + b does actually call torch.Tensor.add, so whenever some operation is executed, it is always done with the knowledge of what device is used.
Related
I designed a neural network in PyTorch, which is demanding a lot of GPU memory or else runs with a very small batch size.
The GPU Runtime error is causing due to three lines of code, which stores two new tensors and does some operations.
I don't want to run my code with a small batch size. So, I want to execute those three lines of code (and hence storing those new tensors) on CPU and remaining all other code on GPU as usual.
Is it possible to do?
It is possible.
You can use the command .to(device=torch.device('cpu') to move the relevant tensors from GPU to CPU, and back to GPU afterwards:
orig_device = a.device # store the device from which the tensor originated
# move tensors a and b to CPU
a = a.to(device=torch.device('cpu'))
b = b.to(device=torch.device('cpu'))
# do some operation on a and b - it will be executed on CPU
res = torch.bmm(a, b)
# put the result back to GPU
res = res.to(device=orig_device)
A few notes:
Moving tensors between devices, or between GPU and CPU is not an unusual event. The term used to describe it is "model parallel" - you can google it for more details and examples.
Note that .to() operation is not an "in place" operation.
Moving tensor back and forth between GPU and CPU takes time. It might not be worthwhile using "model parallelism" of this type in this case. If you are struggling with GPU space, you might consider gradient accumulation instead
policy_data, value_data, action_mask = policy_data.cuda(non_blocking=True), value_data.cuda(non_blocking=True), action_mask.cuda(non_blocking=True)
rewards, regret_probs = rewards.cuda(non_blocking=True), regret_probs.cuda(non_blocking=True)
return action_probs.cpu(), sample_probs.cpu(), sample_indices.cpu(), update
I am doing some RL work and am wondering whether it would be possible to speed up fragments like the above by launching the data transfer to the GPU on different streams before waiting on them together. Does PyTorch have any functions that would make this easier? I'd rather ask here before I dive into the minutiae of optimizing data transfers.
Seems like one potential solution would be to pack all of the data into a single tensor (though of course you'd likely pay a small cost due to unused elements within this compacted representation.) An alternative would be to store this compact tensor as a sparse tensor (no additional data, but slightly more memory consumption per value). You'd have to test between these two to determine which was more efficient for your use case.
I have a single NVIDIA GPU which has a memory of 16GB. I have to run two different (and independent; meaning, two different problems: one is a vision type task, another is NLP task) Python programs. The codes are written using PyTorch and both the codes can use GPU.
I have tested that program 1 takes roughly 5GB of GPU memory, and the rest is free. If I run the two programs, will it hamper the model performance or will it cause any process conflicts?
Linked question; but it does not necessarily mean PyTorch codes
I do not know the details of how this works, but I can tell from experience that both programs will run well (as long as they do not need more than 16GB of RAM when combined), and execution times should stay roughly the same.
However, computer vision usually requires a lot of IO (mostly reading images), if the other task needs to read files too, this part may become slower than when running both programs individually.
It should work fine.
In one of my projects, I faced the problem of lack of GPU memory while working with multiple models. After loading them, my models used to take up most of the GPU memory. And during model inference, very less memory used to remain for the data. As we know, if your models are loaded on GPU then you also need to load your data on your GPU. So when you do batch inference (for eg giving 16 images at a time to the model) the complete batch is loaded on the GPU. This again takes more GPU memory. Your program crashes if it does not get enough GPU memory.
If you think GPU memory is not the issue in your case then everything should work fine. You also do not need to worry about conflicts because both processes will allocate their own GPU memory and will work independently. There would be no performance issues.
I've viewed some discussion about MCTS and GPU. It's said there's no advantage using GPU, as it doesn't have many matrix-multiply. But it does have a drawback using CPU, as the data transfering between devices really takes time.
Here I mean the nodes and tree should be on GPU. Then they may process the data on GPU, without copying the the data from CPU. If I just create class node and tree, they will let their methods work on CPU.
So I wonder whether I can move the searching part to GPU. Is there any example?
I can load the model and a data sample in gpu memory, but when I call forward on the model with the sample, it gives a CUDA out of memory error.
I'm sure the model and data have been loaded, as my code is structured as follows (pseudocode):
model = Model()
sample = load_sample()
sleep(5) # to check memory usage with nvidia-smi
print('before forward')
model(sample)
print('after forward')
"before forward" gets printed, but "after forward" does not.
I assumed all the necessary memory for a forward pass gets allocated during construction of the model, but I don't know how else this error can happen. I also cannot find it on Google.
Python: 3.6.9
PyTorch: 1.2.0
It is not possible to determine the amount of space required to store the activations before runtime and hence GPU memory increases. Pytorch maintains a dynamic computation graph and hence the order of computations is not at all known before runtime. When you declare/initialize the model, only __init__ is called and model parameters are initialized. To figure out the graph one would need to look at the forward call and maybe also loss function (if it is not within forward call).
Let's say we can look at the forward call before running the model but still the batch size is unknown and hence memory can't be pre-allocated for activations.
Even if the batch size is known, there could be other unknowns like sequence size (for RNN), or episode size in RL that make it hard to pre-allocate memory for activations. Even if we account for all this at the declaration, pytorch naturally allows for for-loops which makes it almost impossible to pre-allocate space for activations and hence GPU memory can increase during runtime depending on the use case.
As Umang Gupta pointed out in the comments, GPU memory will increase during a forward() call on a Pytorch model, as (possibly amongst others) the batch size is not known before runtime. Therefore the required memory cannot be reserved beforehand, and the GPU memory can increase after having loaded the model and data already.