Suppose I have two tensors S and T defined as:
S = torch.rand((3,2,1))
T = torch.ones((3,2,1))
We can think of these as containing batches of tensors with shapes (2, 1). In this case, the batch size is 3.
I want to concatenate all possible pairings between batches. A single concatenation of batches produces a tensor of shape (4, 1). And there are 3*3 combinations so ultimately, the resulting tensor C must have a shape of (3, 3, 4, 1).
One solution is to do the following:
for i in range(S.shape[0]):
for j in range(T.shape[0]):
C[i,j,:,:] = torch.cat((S[i,:,:],T[j,:,:]))
But the for loop doesn't scale well to large batch sizes. Is there a PyTorch command to do this?
I don't know of any command out-of-the-box that does such operation. However, you can pull it off in a straightforward way using a single matrix multiplication.
The trick is to construct a tensor containing all pairs of batch elements by starting from already stacked S,T tensor. Then by multiplying it with a properly chosen mask tensor... In this method, keeping track of shapes and dimension sizes is essential.
The stack is given by (notice the reshape, we essentially flatten the batch elements from S and T into a single batch axis on ST):
>>> ST = torch.stack((S, T)).reshape(6, 2)
>>> ST
tensor([[0.7792, 0.0095],
[0.1893, 0.8159],
[0.0680, 0.7194],
[1.0000, 1.0000],
[1.0000, 1.0000],
[1.0000, 1.0000]]
# ST.shape = (6, 2)
You can retrieve all (S[i], T[j]) pairs using range and itertools.product:
>>> indices = torch.tensor(list(product(range(0, 3), range(3, 6))))
tensor([[0, 3],
[0, 4],
[0, 5],
[1, 3],
[1, 4],
[1, 5],
[2, 3],
[2, 4],
[2, 5]])
# indices.shape = (9, 2)
From there, we construct one-hot-encodings of the indices using torch.nn.functional.one_hot:
>>> mask = one_hot(indices).float()
tensor([[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]]])
# mask.shape = (9, 2, 6)
Finally, we compute the matrix multiplication and reshape it to the final form:
>>> (mask#ST).reshape(3, 3, 4, 1)
tensor([[[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]]],
[[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]]],
[[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]]]])
I initially went with torch.einsum: torch.einsum('bf,pib->pif', ST, mask). But, later realized than that bf,pib->pif reduces nicely to a simple torch.Tensor.matmul operation if we switch the two operands: i.e. with pib,bf->pif (subscript b is reduced in the middle).
In numpy something called np.meshgrid is used.
https://stackoverflow.com/a/35608701/3259896
So in pytorch, it would be
torch.stack(
torch.meshgrid(x, y)
).T.reshape(-1,2)
Where x and y are your two lists. You can use any number. x, y , z, etc.
And then you reshape it to the number of lists you use.
So if you used three lists, use .reshape(-1,3), for four use .reshape(-1,4), etc.
So for 5 tensors, use
torch.stack(
torch.meshgrid(a, b, c, d, e)
).T.reshape(-1,5)
I'm training a CNN architecture to solve a regression problem using PyTorch where my output is a tensor of 25 values. The input/target tensor could be either all zeros or a gaussian distribution with a sigma value of 2. An example of a 4-sample batch is as this one:
[[0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534 ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
My question is how to design a loss function for the model effectively learn the regression output with 25 values.
I have tried 2 types of loss, torch.nn.MSELoss() and torch.nn.MSELoss()-torch.nn.CosineSimilarity(). They sort of work. However, sometimes the network has difficulty converging, especially when there are a lot of samples with all "zeros", which leads the network to output a vector with all 25 small values.
My question is, is there any other loss which we could try?
Your values do not seem widely different in scale so an MSELoss seems like it would work fine. Your model could be collapsing because of the many zeros in your target.
You can always try torch.nn.L1Loss() (but I do not expect it to be much better than torch.nn.MSELoss())
I suggest that you instead try to predict the gaussian mean/mu, and later try to re-create the gaussian for each sample if you really need it.
So you have two alternatives if you choose to try this method.
Alt 1
A good alternative is to encode your target to look like a classification target. Your 25 element vectors become a single value where the original target == 1 (possible classes will 0, 1, 2, ..., 24). We can then assign a sample that contains "only zeroes" as our last class "25". So your target:
[[0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534 ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
becomes
[4,
10,
20,
25]
If you do this, then you can try the common torch.nn.CrossEntropyLoss().
I do not know what your dataloader looks like but given a single sample in your original format, you can convert it to my proposed format with:
def encode(tensor):
if tensor.sum() == 0:
return len(tensor)
return torch.argmax(tensor)
and back to a gaussian with:
def decode(value):
n_values = 25
zero = torch.zeros(n_values)
if value == n_values:
return zero
# Create gaussian around value
std = 2
n = torch.arange(n_values) - value
sig = 2*std**2
gauss = torch.exp(-n**2 / sig2)
# Only return 9 values from the gaussian
start_ix = max(value-6, 0)
end_ix = min(value+7,n_values)
zero[start_ix:end_ix] = gauss[start_ix:end_ix]
return zero
(Note I have not tried them with batches, only samples)
Alt 2
The second option is to change your regression targets (still only the argmax positions (mu)) to a nicer regression value in the range 0-1 and have a separate neuron that outputs a "mask value" (also 0-1). Then your batch of:
[[0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534, 0.043937, 0.011109, 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.13534, 0.32465, 0.60653, 0.8825, 1.0000, 0.88250,0.60653, 0.32465, 0.13534 ],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]
becomes
# [Mask, mu]
[
[1, 0.1666], # True, 4/24
[1, 0.4166], # True, 10/24
[1, 0.8333], # True, 20/24
[0, 0] # False, undefined
]
If you are using this setup, then you should be able to use an MSELoss with modification:
def custom_loss(input, target):
# Assume target and input is of shape [Batch, 2]
mask = target[...,1]
mask_loss = torch.nn.functional.mse_loss(input[...,0], target[...,0])
mu_loss = torch.nn.functional.mse_loss(mask*input[...,1], mask*target[...,1])
return (mask_loss + mu_loss) / 2
This loss would only look at the 2nd value (mu) if the mask of the target is 1. Otherwise it only tried to optimize for the correct mask.
To encode to this format you would use:
def encode(tensor):
n_values = 25
if tensor.sum() == 0:
return torch.tensor([0,0])
return torch.argmax(tensor) / (n_values-1)
and to decode:
def decode(tensor):
n_values = 25
# Parse values
mask, value = tensor
mask = torch.round(mask)
value = torch.round((n_values-1)*value)
zero = torch.zeros(n_values)
if mask == 0:
return zero
# Create gaussian around value
std = 2
n = torch.arange(n_values) - value
sig = 2*std**2
gauss = torch.exp(-n**2 / sig2)
# Only return 9 values from the gaussian
start_ix = max(value-6, 0)
end_ix = min(value+7,n_values)
zero[start_ix:end_ix] = gauss[start_ix:end_ix]
return zero
I have been trying to implement a custom batch normalization function such that it can be extended to the Multi GPU version, in particular, the DataParallel module in Pytorch.The custom batchnorm works alright when using 1 GPU, but, when extended to 2 or more, the running mean and variance work in the forward function, but when it returns back from the network, the mean and variance are reinitialized to 0 and 1.
The torch.nn.DataParallel mentions in the warning section that " In each forward, module is replicated on each device, so any updates to the running module in forward will be lost. For example, if module has a counter attribute that is incremented in each forward, it will always stay at the initial value because the update is done on the replicas which are destroyed after forward." But I am not really sure how to retain the mean and variance from the default device.
I have provided code with the result obtained during multi GPU training. This code utilizes the Batchnorm provided here.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
import torchvision
import torchvision.transforms as transforms
from torch.nn.parameter import Parameter
class ptrblck_BatchNorm2d(nn.BatchNorm2d):
def __init__(self, num_features, eps=1e-5, momentum=0.1,
affine=True, track_running_stats=True):
super(ptrblck_BatchNorm2d, self).__init__(
num_features, eps, momentum, affine, track_running_stats)
def forward(self, input):
self._check_input_dim(input)
exponential_average_factor = 0.0
if self.training and self.track_running_stats:
if self.num_batches_tracked is not None:
self.num_batches_tracked += 1
if self.momentum is None: # use cumulative moving average
exponential_average_factor = 1.0 / float(self.num_batches_tracked)
else: # use exponential moving average
exponential_average_factor = self.momentum
# calculate running estimates
if self.training:
mean = input.mean([0, 2, 3])
# use biased var in train
var = input.var([0, 2, 3], unbiased=False)
n = input.numel() / input.size(1)
with torch.no_grad():
self.running_mean = exponential_average_factor * mean\
+ (1 - exponential_average_factor) * self.running_mean
# update running_var with unbiased var
self.running_var = exponential_average_factor * var * n / (n - 1)\
+ (1 - exponential_average_factor) * self.running_var
else:
mean = self.running_mean
var = self.running_var
input = (input - mean[None, :, None, None]) / (torch.sqrt(var[None, :, None, None] + self.eps))
if self.affine:
input = input * self.weight[None, :, None, None] + self.bias[None, :, None, None]
return input
class net(nn.Module):
def __init__(self):
super(net, self).__init__()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.bn1 = ptrblck_BatchNorm2d(64)
print("==> printing bn1 mean when init")
print(self.bn1.running_mean)
print("==> printing bn1 when init")
print(self.bn1.running_mean)
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
self.classifier = nn.Linear(64, 10)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = F.relu(x)
x = self.pool(x)
x = self.avgpool(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
print("======================================================")
print("==> printing bn1 running mean from NET during forward")
print(net.module.bn1.running_mean)
print("==> printing bn1 running mean from SELF. during forward")
print(self.bn1.running_mean)
print("==> printing bn1 running var from NET during forward")
print(net.module.bn1.running_var)
print("==> printing bn1 running mean from SELF. during forward")
print(self.bn1.running_var)
return x
# Data
print('==> Preparing data..')
transform_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# Model
print('==> Building model..')
net = net()
net = torch.nn.DataParallel(net).cuda()
print('Number of GPU {}'.format(torch.cuda.device_count()))
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
# Training
def train(epoch):
print('\nEpoch: %d' % epoch)
net.train()
train_loss = 0
correct = 0
total = 0
for batch_idx, (inputs, targets) in enumerate(trainloader):
inputs, targets = inputs.cuda(), targets.cuda()
outputs = net(inputs)
loss = criterion(outputs, targets)
print("====================================================")
print("==> printing bn1 running mean FROM net after forward")
print(net.module.bn1.running_mean)
print("==> printing bn1 running var FROM net after forward")
print(net.module.bn1.running_var)
break
# optimizer.zero_grad()
# loss.backward()
# optimizer.step()
# train_loss += loss.item()
# _, predicted = outputs.max(1)
# total += targets.size(0)
# correct += predicted.eq(targets).sum().item()
# break
for epoch in range(0, 1):
train(epoch)
Result:
==> Preparing data..
Files already downloaded and verified
Files already downloaded and verified
==> Building model..
==> printing bn1 mean when init
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
==> printing bn1 when init
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Number of GPU 2
Epoch: 0
======================================================
==> printing bn1 running mean from NET during forward
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0')
==> printing bn1 running mean from SELF. during forward
tensor([ 0.0053, 0.0010, -0.0077, -0.0290, 0.0241, 0.0258, -0.0048, 0.0151,
-0.0133, 0.0080, 0.0197, -0.0042, -0.0188, 0.0233, 0.0310, -0.0230,
-0.0133, 0.0222, 0.0119, -0.0042, -0.0220, -0.0169, -0.0342, -0.0025,
0.0338, -0.0070, 0.0202, 0.0050, 0.0108, 0.0008, 0.0363, 0.0347,
-0.0106, 0.0082, 0.0128, 0.0074, 0.0111, -0.0030, -0.0089, 0.0070,
-0.0262, -0.0029, 0.0053, -0.0136, -0.0183, 0.0045, -0.0014, -0.0221,
0.0132, 0.0064, 0.0388, -0.0220, -0.0008, 0.0400, -0.0187, 0.0397,
-0.0131, -0.0176, 0.0035, 0.0055, -0.0270, 0.0066, -0.0149, 0.0135],
device='cuda:0')
==> printing bn1 running var from NET during forward
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
==> printing bn1 running mean from SELF. during forward
tensor([0.9665, 0.9073, 0.9220, 1.0947, 1.0687, 0.9624, 0.9252, 0.9131, 0.9066,
0.9536, 0.9258, 0.9203, 1.0359, 0.9690, 1.1066, 1.0636, 0.9135, 0.9644,
0.9373, 0.9846, 0.9696, 0.9454, 1.0459, 0.9245, 0.9778, 0.9709, 0.9352,
0.9995, 0.9657, 0.9510, 1.0943, 1.0171, 0.9298, 1.0747, 0.9341, 0.9635,
0.9978, 0.9303, 0.9261, 0.9137, 0.9569, 1.0066, 1.0463, 0.9955, 0.9621,
0.9172, 0.9836, 0.9817, 0.9086, 0.9576, 1.0905, 0.9861, 0.9661, 1.1773,
0.9345, 1.0904, 0.9133, 1.0660, 0.9164, 0.9058, 0.9446, 0.9225, 1.0914,
0.9292], device='cuda:0')
======================================================
==> printing bn1 running mean from NET during forward
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0')
==> printing bn1 running mean from SELF. during forward
tensor([-0.0020, 0.0002, -0.0103, -0.0426, 0.0386, 0.0311, -0.0059, 0.0151,
-0.0140, 0.0145, 0.0218, -0.0029, -0.0281, 0.0284, 0.0449, -0.0329,
-0.0107, 0.0278, 0.0135, -0.0123, -0.0260, -0.0214, -0.0423, -0.0035,
0.0410, -0.0097, 0.0276, 0.0102, 0.0197, -0.0001, 0.0483, 0.0451,
-0.0078, 0.0190, 0.0135, -0.0004, 0.0196, -0.0028, -0.0140, 0.0070,
-0.0332, -0.0110, 0.0151, -0.0210, -0.0226, 0.0074, -0.0088, -0.0314,
0.0125, -0.0003, 0.0505, -0.0312, 0.0086, 0.0544, -0.0245, 0.0528,
-0.0086, -0.0290, 0.0063, 0.0042, -0.0339, 0.0061, -0.0277, 0.0092],
device='cuda:1')
==> printing bn1 running var from NET during forward
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
==> printing bn1 running mean from SELF. during forward
tensor([0.9665, 0.9072, 0.9211, 1.0999, 1.0714, 0.9610, 0.9209, 0.9125, 0.9063,
0.9553, 0.9260, 0.9189, 1.0386, 0.9706, 1.1139, 1.0610, 0.9121, 0.9660,
0.9366, 0.9886, 0.9683, 0.9454, 1.0511, 0.9227, 0.9792, 0.9704, 0.9330,
0.9989, 0.9657, 0.9476, 1.1008, 1.0191, 0.9294, 1.0814, 0.9320, 0.9642,
1.0006, 0.9287, 0.9254, 0.9128, 0.9559, 1.0100, 1.0521, 0.9972, 0.9621,
0.9168, 0.9849, 0.9803, 0.9083, 0.9556, 1.0946, 0.9865, 0.9651, 1.1880,
0.9330, 1.0959, 0.9116, 1.0706, 0.9149, 0.9057, 0.9450, 0.9215, 1.0972,
0.9261], device='cuda:1')
====================================================
==> printing bn1 running mean FROM net after forward
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cuda:0')
==> printing bn1 running var FROM net after forward
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
How can I make sure that the running estimates of the default device be used? Currently, I am not working towards synchronized Batchnorm.
Replacing
self.running_mean = (...)
with
self.running_mean.copy_(...)
did the job.
Reference