increasing batch size during inference

increasing batch size during inference - pytorch

I believed that the inference time per batch was independent of the batch size when using a GPU, but this minimal example tells me that this doesn't seem true:
import torch
from torch import nn
from tqdm import tqdm
BATCH_SIZE = 32
N_ITER = 10000
class NN(nn.Module):
def __init__(self):
super(NN, self).__init__()
self.layer = nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=3, bias=False)
def forward(self, input):
out = self.layer(input)
return out
cnn = NN().cuda()
cnn.eval()
tensor = torch.rand(BATCH_SIZE, 3, 999, 999).cuda()
with torch.no_grad():
for _ in tqdm(range(N_ITER), mininterval=0.1):
out = cnn(tensor)
When increasing BATCH_SIZE, the "it/s" shown by tqdm increases proportionally:
Plot of inference time vs batch size
It was my believe that the GPU can process the entire tensor simultaneously, as long as it doesn't use all the memory. Maybe I don't understand something about how GPUs process data in parallel, so I would appreciate some insights here.
I am using a NVIDIA GeForce 2080 Ti, pytorch 1.6.0 and CUDA 10.2.

You are wrong. GPUs have a lot of cores but that does not mean that they can process all the data at the same time. For instance, RTX 2080Ti only has 4352 cores.

Related

Will switching GPU device affect the gradient in PyTorch back propagation?

I use the Pytorch. In the computation, I move some data and operators A in the GPU. In the middle step, I move the data and operators B to CPU and continue the forward.
My question is that:
My operator B is very memory-consuming that cannot be used in GPU. Will this affect (some parts compute in GPU and the others are computed in CPU) the backpropagation?

Pytorch keeps track of the location of tensors. If you use .cpu() or .to('cpu') pytorch's native commands you should be okay.
See, e.g., this model parallel tutorial - the computation is split between two different GPU devices.

If your model fits into the GPU memory, you might let PyTorch do the parallel distribution for you within the DataParallel (one process multiple threads) or DistributedDataParallel (multiple processes multiple threads, single or multiple nodes) frameworks.
Code below checks if you have a gpu device torch.cuda.device_count() > 1 and sets the DataParallel mode model = nn.DataParallel(model)
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
DataParallel replicates the same model to all GPUs, where each GPU consumes a different partition of the input data, it can significantly accelerate the training process, but it does not work for some use cases where the model is too large to fit into a single GPU.
To solve this problem, you might resort to a model parallel approach, which splits a single model onto different GPUs, rather than replicating the entire model on each GPU.
(e.g. a model m contains 10 layers: when using DataParallel, each GPU
will have a replica of each of these 10 layers, whereas when using
model parallel on two GPUs, each GPU could host 5 layers)
An example where .to('cuda:0') indicates where the layer should be positioned.
import torch
import torch.nn as nn
import torch.optim as optim
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10).to('cuda:0')
self.relu = torch.nn.ReLU()
self.net2 = torch.nn.Linear(10, 5).to('cuda:1')
def forward(self, x):
x = self.relu(self.net1(x.to('cuda:0')))
return self.net2(x.to('cuda:1'))
backward() then automatically takes location into consideration.
model = ToyModel()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cuda:1')
loss_fn(outputs, labels).backward()
optimizer.step()
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

This snippet suggests that the gradient is preserved when computation goes through different devices.
def change_device():
import torch.nn as nn
a = torch.rand((4, 32))
m1 = nn.Linear(32, 32)
cpu = m1(a)
gpu = cpu.to(0)
m2 = nn.Linear(32, 32).to(0)
out = m2(gpu)
loss = out.sum()
loss.backward()
print(m1.weight.grad)
# works like magic
"""
tensor([[ 0.7746, 1.0342, 0.8706, ..., 1.0993, 0.7975, 0.3915],
[-0.5369, -0.7169, -0.6034, ..., -0.7619, -0.5527, -0.2713],
[ 0.3607, 0.4815, 0.4053, ..., 0.5118, 0.3713, 0.1823],
...,
[ 1.1200, 1.4955, 1.2588, ..., 1.5895, 1.1531, 0.5660],
[-0.1582, -0.2112, -0.1778, ..., -0.2245, -0.1629, -0.0799],
[-0.4531, -0.6050, -0.5092, ..., -0.6430, -0.4665, -0.2290]])
"""
Modifying this snippet, the gradient is preserved when tensor moves from gpu to cpu as well.

Is the forward definition of a model executed sequentially in PyTorch or in parallel?

I wanted to know if instructions in forward definition of deep models class are executed sequentially? For example:
class Net(nn.Module):
...
def forward(self,x):
#### Group 1
y = self.conv1(x)
y = self.conv2(y)
y = self.conv3(y)
### Group 2
z = self.conv4(x)
z = self.conv5(z)
z = self.conv6(z)
out = torch.cat((y,z),dim=1)
return out
In this case Group1 and Group2 instructions can be parallelized. But will the forward definition understand this automatically or will they be executed sequentially? If no, then how to run them in parallel?
I am running PyTorch 1.3.1
Thankyou very much

They are executed sequentially, only the calculations of the operations are parallelised. As far as I'm aware, there is no direct way to let them run in parallel by PyTorch.
I'm assuming that you are expecting a performance improvement from running them in parallel, but that would be at best minimal and at worst a lot slower, because operations like convolutions are already heavily parallelised and unless the input is extremely small, all cores will be used permanently. Running multiple convolutions in parallel would result in a lot of context switches, except if you would distribute the available cores evenly, but that wouldn't really make it any faster than doing them sequentially with all cores instead.
You can observe the same behaviour if you run two PyTorch programs at the same time, for example running the following, which has 3 relatively common convolutions and uses 224x224 images (like ImageNet), which is small compared to what other models (e.g. object detection) use:
import torch
import torch.nn as nn
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
def forward(self, input):
out = self.conv1(input)
out = self.conv2(out)
out = self.conv3(out)
return out
input = torch.randn((10, 3, 224, 224))
model = Model().eval()
# Running it 100 times just to create a microbenchmark
for i in range(100):
out = model(input)
To obtain information about context switches, /usr/bin/time can be used (not the built in time).
/usr/bin/time -v python bench.py
Single run:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.68
Involuntary context switches: 857
Running two instances at the same time:
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:43.69
Involuntary context switches: 456753
To clarify, each of the instances took about 43 seconds, that's not the accumulated time.

Is there a difference between Pytorch 0.4.1 and 1.1.0

When I finished my training task of semantic segmentation (pytorch 0.4.1 GPU CUDA9.0), and successed in inference of the model(pytorch 0.4.1), however when I switched my pytorch version to 1.1.0, I got slightly different result. What's the problem???

I found where the difference existed using just one Conv2d layer. In pytorch0.4.1, the output of nn.Conv2d and the output of formula are always same. But Sometimes they are different in pytorch1.1. I'm confused!!!!
import torch.nn as nn
torch.set_printoptions(precision=64)
input_t = torch.randn((3,3))
input_t = input_t.unsqueeze(0).unsqueeze(0).float()
class minimodel(nn.Module):
def __init__(self):
super(minimodel, self).__init__()
self.conv = nn.Conv2d(1, 1, kernel_size=3)
def forward(self, x):
x = self.conv(x)
return x
demo_model = minimodel()
weight = torch.load("model.ckpt")
demo_model.load_state_dict(torch.load("model.ckpt"))
demo_model.eval()
output_t = demo_model(input_t)
# torch.save(demo_model.state_dict(),"model.ckpt")
print("#####output of nn.Conv2d#####")
print(output_t)
Kernel_W = weight['conv.weight']
print("####output of formula######")
print(torch.add(weight['conv.bias'],torch.sum(torch.mul(Kernel_W, input_t))))```

How can I change the padded input size per channel in Pytorch?

I am trying to set up an image classifier using Pytorch. My sample images have 4 channels and are 28x28 pixels in size. I am trying to use the built-in torchvision.models.inception_v3() as my model. Whenever I try to run my code, I get this error:
RuntimeError: Calculated padded input size per channel: (1 x 1).
Kernel size: (3 x 3). Kernel size can't greater than actual input size
at
/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THNN/generic/SpatialConvolutionMM.c:48
I can't find how to change the padded input size per channel or quite figure out what the error means. I figure that I must modify the padded input size per channel since I can't edit the Kernel size in the pre-made model.
I have tried padding, but it didn't help.
Here is a shortened part of my code that throws the error when I call train():
import torch
import torchvision as tv
import torch.optim as optim
from torch import nn
from torch.utils.data import DataLoader
model = tv.models.inception_v3()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001, weight_decay=0)
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=4, gamma=0.9)
trn_dataset = tv.datasets.ImageFolder(
"D:/tests/classification_test_data/trn",
transform=tv.transforms.Compose([tv.transforms.RandomRotation((0,275)), tv.transforms.RandomHorizontalFlip(),
tv.transforms.ToTensor()]))
trn_dataloader = DataLoader(trn_dataset, batch_size=32, num_workers=4, shuffle=True)
for epoch in range(0, 10):
train(trn_dataloader, model, criterion, optimizer, lr_scheduler, 6, 32)
print("End of training")
def train(train_loader, model, criterion, optimizer, scheduler, num_classes, batch_size):
model.train()
scheduler.step()
for index, data in enumerate(train_loader):
inputs, labels = data
optimizer.zero_grad()
outputs = model(inputs)
outputs_flatten = flatten_outputs(outputs, num_classes)
loss = criterion(outputs_flatten, labels)
loss.backward()
optimizer.step()
def flatten_outputs(predictions, number_of_classes):
logits_permuted = predictions.permute(0, 2, 3, 1)
logits_permuted_cont = logits_permuted.contiguous()
outputs_flatten = logits_permuted_cont.view(-1, number_of_classes)
return outputs_flatten

It could be due the following. Pytorch documentation for Inception_v3 model notes that the model expects input of shape Nx3x299x299. This is because the architecture contains a fully connected layer which fixed shape.
Important: In contrast to the other models the inception_v3 expects tensors with a size of N x 3 x 299 x 299, so ensure your images are sized accordingly.
https://pytorch.org/docs/stable/torchvision/models.html#inception-v3

May be this is a late post, but i tried to sort out this with a simple technique.
In got this kind of error, i was using custom conv2d module, and somehow i missed sending the padding to my nn.conv2d.
I found out this error by,
In my conv2d implementation, i printed out the shape of the output variable, and found the exact bug in my code.
model = VGG_BNN_ReLU('VGG11',10)
import torch
x = torch.randn(1,3,32,32)
model.forward(x)
Hope this helps.Happy learning

Tensorflow ResourceExhausted Error OOM when allocating tensor with shape [] using GPU

I am trying to train a VGG-style CNN on tensorflow and my input size is:
2 batchsize * 1080 hight *1920 wide * 5 channel and my network structure is:
Conv 3*3*64
Conv 3*3*128
Maxpooling 3*3 stride 3
Conv 3*3*256
And a COnv with 1*1*256*2 outputs
Everything runs OK on CPU version, But has error using GPU:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[2,128,1080,1920]
[[Node: Feature_Map/Conv2 = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:0"](Feature_Map/Act1, Feature_Map/Variable_2/read)]]
Seems it gets stuck on the second layer. my GPU is GTX1060(6G) on Spyder (windows). I have added the following command to allow GPU memory growth using:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
Can someone help me with this? Thanks very much

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

increasing batch size during inference - pytorch

You are wrong. GPUs have a lot of cores but that does not mean that they can process all the data at the same time. For instance, RTX 2080Ti only has 4352 cores.

Related

Will switching GPU device affect the gradient in PyTorch back propagation?

Is the forward definition of a model executed sequentially in PyTorch or in parallel?

Is there a difference between Pytorch 0.4.1 and 1.1.0

How can I change the padded input size per channel in Pytorch?

Tensorflow ResourceExhausted Error OOM when allocating tensor with shape [] using GPU

Categories

Resources