If I call model.cuda() in pytorch where model is a subclass of nn.Module, and say if I have four GPUs, how it will utilize the four GPUs and how do I know which GPUs that are using?
If you have a custom module derived from nn.Module after model.cuda() all model parameters, (model.parameters() iterator can show you these) will end on your cuda.
To check where are your parameters just print them (cuda:0) in my case:
class M(nn.Module):
'custom module'
def __init__(self):
super().__init__()
self.lin = nn.Linear(784, 10)
m = M()
m.cuda()
for _ in m.parameters():
print(_)
# Parameter containing:
# tensor([[-0.0201, 0.0282, -0.0258, ..., 0.0056, 0.0146, 0.0220],
# [ 0.0098, -0.0264, 0.0283, ..., 0.0286, -0.0052, 0.0007],
# [-0.0036, -0.0045, -0.0227, ..., -0.0048, -0.0003, -0.0330],
# ...,
# [ 0.0217, -0.0008, 0.0029, ..., -0.0213, 0.0005, 0.0050],
# [-0.0050, 0.0320, 0.0013, ..., -0.0057, -0.0213, 0.0045],
# [-0.0302, 0.0315, 0.0356, ..., 0.0259, 0.0166, -0.0114]],
# device='cuda:0', requires_grad=True)
# Parameter containing:
# tensor([-0.0027, -0.0353, -0.0349, -0.0236, -0.0230, 0.0176, -0.0156, 0.0037,
# 0.0222, -0.0332], device='cuda:0', requires_grad=True)
You can also specify the device like this:
m.cuda('cuda:0')
With torch.cuda.device_count() you may check how many devices you have.
To expand on prosti's answer to split your computations among multiple GPUs you should use torch.nn.DataParallel or DistributedDataParallel.
Related
I have something like this model=BertModel.from_pretrained('bert-base-uncased',return_dict=True)
What exactly is this "return_dict" used for? What happens when True and what when False?
When instantiating a BertModel, the default output of the model when evaluating or predicting will be a tuple consisting of loss, logits, hidden_states and attentions.
predictions = model(ids_tensor)
print(predictions)
# MaskedLMOutput(loss=None, logits=tensor([[
# [ -0.2506, -5.6671, -5.1753, ..., -5.3228, -7.9154, -4.5786],
# [ -4.1528, -8.2391, -8.5691, ..., -8.4557, -8.2903, -10.1395],
# [-15.5995, -17.0001, -16.9896, ..., -14.1423, -15.6004, -15.8228],
# ...,
# [ 3.0180, -2.9339, -3.3522, ..., -4.1684, -4.9487, -1.7176],
# [-12.7654, -12.9510, -12.9151, ..., -10.5786, -11.1695, -9.6117],
# [ -4.0356, -9.7091, -9.5329, ..., -9.3969, -10.5371, -9.2839]]]),
# hidden_states=None, attentions=None)
If the argument return_dict is defined as True, the output changes into a 'ModelOutput', called like this in the documentation of HuggingFace. This output consist on the elements last_hidden_state, hidden_states, pooler_output, past_key_values, attentionsand cross_attentions. Hope I could be of help.
model = BertModel.from_pretrained('bert-base-uncased',return_dict=True)
predictions = model(ids_tensor)
print(predictions)
# BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[
# [ 0.0769, -0.0024, 0.0389, ..., -0.0489, 0.0484, 0.4760],
# [-0.1383, -0.3266, 0.2738, ..., -0.0745, 0.0224, 0.8426],
# [-0.4573, -0.0621, 0.4206, ..., 0.0188, 0.1578, 0.4477],
# ...,
# [ 0.7070, -0.1623, 0.4451, ..., -0.1530, 0.0902, 0.8289],
# [ 0.7154, 0.0767, -0.2292, ..., 0.2946, -0.5152, -0.2444],
# [ 0.3558, 0.1660, 0.0459, ..., 0.5960, -0.7525, -0.0851]]]),
# pooler_output=tensor([[-7.4716e-01, -1.4339e-01, …, 7.6550e-01]]),
# hidden_states=None,
# past_key_values=None,
# attentions=None,
# cross_attentions=None)
Source: Bert Documentation in Transformers - HuggingFace
I declare my RNN as
self.rnn = torch.nn.RNN(input_size=encoding_dim, hidden_size=1, num_layers=1, nonlinearity='relu')
Later
self.rnn.all_weights
# [[Parameter containing:
tensor([[-0.8099, -0.9543, 0.1117, 0.6221, 0.5034, -0.6766, -0.3360, -0.1700,
-0.9361, -0.3428]], requires_grad=True), Parameter containing:
tensor([[-0.1929]], requires_grad=True), Parameter containing:
tensor([0.7881], requires_grad=True), Parameter containing:
tensor([0.4320], requires_grad=True)]]
self.rnn.all_weights[0][0][0].values
# {RuntimeError}Could not run 'aten::values' with arguments from the 'CPU' backend. 'aten::values' is only available for these backends: [SparseCPU, Autograd, Profiler, Tracer].
Clearly I see the value of the weights, but cannot access to it. Documentation says I need to specify requires_grad=True, but that does not work.
Is there a more elegant and usable way than self.rnn.all_weights[0][0][0]?
Use torch.nn.Module.named_parameters or torch.nn.Module.parameters.
>>> import torch.nn as nn
>>> model = nn.RNN(input_size=encoding_dim, hidden_size=1, num_layers=1, nonlinearity='relu')
>>> weights = []
>>> for name, parameter in model.named_parameters():
... weights.append({name: parameter[0]})
...
>>> just_weights = []
>>> for parameter in model.parameters():
... just_weights.append(parameter[0])
...
I am training a dynamic neural network, meaning that each epoch I tweak the architecture and get a different computational graph.
I want to plot the graph for each epoch using tensorboard, but when I use SummaryWriter.add_graph() at the end of each epoch it simply overwrites the previous one.
Any ideas how to plot several graphs using pytorch + tensorboard? It seems achievable as each graph has a “tag” but I found no option to change this tag to plot several of them.
Thanks,
Elad
If you still want to use SummaryWriter, there is the optionn of using the method "add_scalars":
Example:
summary.add_scalars(f'loss/check_info', {
'score': score[iteration],
'score_nf': score_nf[iteration],
}, iteration)
Instead of using the "tag" feature, you can use the "run" feature.
To do so, you have to open tensorboard from a directory within which you stored your summaries in distinct sub-directories.
In your example, you could save the summary of the first epoch at the directory "tensorboard_log_dir/epoch_1", and then save the summary of the second epoch at the directory "tensorboard_log_dir/epoch_2", etc.
That way, when using tensorboard --logdir=tensorboard_log_dir, you'll be able to switch from one computational graph to another via the "run" widget.
Here's a reproducible example:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter
dummy_input = (torch.zeros(1, 3),)
# Two different architectures (PyTorch)
class oneLinear(nn.Module):
def __init__(self):
super(oneLinear, self).__init__()
self.l1 = nn.Linear(3, 5)
def forward(self, x):
x = self.l1(x)
return x
class twoLinear(nn.Module):
def __init__(self):
super(twoLinear, self).__init__()
self.l1 = nn.Linear(3, 5)
self.l2 = nn.Linear(5, 5)
def forward(self, x):
x = self.l1(x)
x = F.relu(self.l2(x))
return x
# add graph into 2 distinct subdirectories
with SummaryWriter('./tensorboard_log_dir/oneLinear') as w:
w.add_graph(oneLinear(), dummy_input)
with SummaryWriter('./tensorboard_log_dir/twoLinear') as w:
w.add_graph(twoLinear(), dummy_input)
I want to use output variables of NN as an input in another function,but met with error like this 'Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment'.The out variables require gradient.
I tried by changing the output variables to numpy values, but in that case the back propagataion does not work because it see numpy values as variables which does not need gradient.
output = model(SOC[13])
# Three output values of NN
Rs=output[0]
R1=output[1]
C1=output[2]
# Using these variables in another function
num1=[Rs*R1*C1,R1+Rs]
den1=[C1*R1,1]
G = control.tf(num,den)
It should work, but it gives error.
14 num=[Rs*R1*C1,R1+Rs]
15 den=[C1*R1,1]
---> 16 G = control.tf(num,den)
~\Anaconda3\lib\site-packages\control\xferfcn.py in __init__(self, *args)
106
107 """
--> 108 args = deepcopy(args)
109 if len(args) == 2:
110 # The user provided a numerator and a denominator.
~\Anaconda3\lib\site-packages\torch\tensor.py in __deepcopy__(self, memo)
16 def __deepcopy__(self, memo):
17 if not self.is_leaf:
---> 18 raise RuntimeError("Only Tensors created explicitly by the user "
19 "(graph leaves) support the deepcopy protocol at the moment")
20 if id(self) in memo:
In pytorch, you can use the #tensor_name#.detach() function
new_tensor = _tensor_.detach()
I met a similar problem once. To be brief, the mistake is caused by deepcopy, which is not suitable for non-leaf node. You can print the Rs, R1 and C1 to check whether they are leaf node.
If they are leaf node, there is "requires_grad=True" and is not "grad_fn=SliceBackward" or "grad_fn=CopySlices". I guess that non-leaf node has grad_fn, which is used to propagate gradients.
#---------------------------------------------------------------------------------
>>>import torch
>>>q = torch.nn.Parameter(torch.Tensor(3,3))
>>>q
Parameter containing:
tensor([[8.7551e-37, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 0.0000e+00]], requires_grad=True)
#q is leaf node
>>>p = q[0,:]
>>>p
tensor([8.7551e-37, 0.0000e+00, 0.0000e+00], grad_fn=<SliceBackward>)
#p is non-leaf node
>>>q[0,0] = 0
>>>q
Parameter containing:
tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]], grad_fn=<CopySlices>)
#if slice operation is made on q, q becomes non-leaf node. The deepcopy is not suitable for q any more.
#-----------------------------------------------------------------------------
In my case, I was using .cuda with Parameter
I changed
self.x = torch.nn.Parameter(x, requires_grad=True).cuda()
to
self.x = torch.nn.Parameter(x, requires_grad=True)
In my case, what helped was to get rid of the registered buffer in
class PositionalEncodingLearned(nn.Module):
def __init__(self, config):
super().__init__()
self.dropout = nn.Dropout(p=config.dropout)
pe = nn.Parameter(torch.randn(size=(config.bptt, 1, config.emsize)))
self.register_buffer('pe', pe)
def forward(self, x: Tensor) -> Tensor:
x = x + self.pe
return self.dropout(x)
and change that to:
class PositionalEncodingLearned(nn.Module):
def __init__(self, config):
super().__init__()
self.dropout = nn.Dropout(p=config.dropout)
self.pe = nn.Parameter(torch.randn(size=(config.bptt, 1, config.emsize)))
def forward(self, x: Tensor) -> Tensor:
x = x + self.pe
return self.dropout(x)
I used a dirty trick to solve this: save and load the tensor to/from disk, which creates another object:
def get_weights_copy(model):
weights_path = 'weights_temp.pt'
torch.save(model.state_dict(), weights_path)
return torch.load(weights_path)
I want to add word dropout to my network so that I can have sufficient training examples for training the embedding of the "unk" token. As far as I'm aware, this is standard practice. Let's assume the index of the unk token is 0, and the index for padding is 1 (we can switch them if that's more convenient).
This is a simple CNN network which implements word dropout the way I would have expected it to work:
class Classifier(nn.Module):
def __init__(self, params):
super(Classifier, self).__init__()
self.params = params
self.word_dropout = nn.Dropout(params["word_dropout"])
self.pad = torch.nn.ConstantPad1d(max(params["window_sizes"])-1, 1)
self.embedding = nn.Embedding(params["vocab_size"], params["word_dim"], padding_idx=1)
self.convs = nn.ModuleList([nn.Conv1d(1, params["feature_num"], params["word_dim"] * window_size, stride=params["word_dim"], bias=False) for window_size in params["window_sizes"]])
self.dropout = nn.Dropout(params["dropout"])
self.fc = nn.Linear(params["feature_num"] * len(params["window_sizes"]), params["num_classes"])
def forward(self, x, l):
x = self.word_dropout(x)
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Don't mind the padding - pytorch doesn't have an easy way of using non zero padding in CNNs, much less trainable non-zero padding, so I'm doing it manually. Dropout also doesn't allow me to use non zero dropout, and I want to separate the padding token from the unk token. I'm keeping it in my example because it's the reason for this question's existence.
This doesn't work because dropout wants Float Tensors so that it can scale them properly, while my input is Long Tensors that don't need to be scaled.
Is there an easy way of doing this in pytorch? I essentially want to use LongTensor-friendly dropout (bonus: better if it will let me specify a dropout constant that isn't 0, so that I could use zero padding).
Actually I would do it outside of your model, before converting your input into a LongTensor.
This would look like this:
import random
def add_unk(input_token_id, p):
#random.random() gives you a value between 0 and 1
#to avoid switching your padding to 0 we add 'input_token_id > 1'
if random.random() < p and input_token_id > 1:
return 0
else:
return input_token_id
#than you have your input token_id
#for this example I take just a random number, lets say 127
input_token_id = 127
#let p be your probability for UNK
p = 0.01
your_input_tensor = torch.LongTensor([add_unk(input_token_id, p)])
Edit:
So there are two options which come to my mind which are actually GPU-friendly. In general both solutions should be much more efficient.
Option one - Doing computation directly in forward():
If you're not using torch.utils and don't have plans using it later this is probably the way to go.
Instead of doing the computation before we just do it in the forward() method of main PyTorch class. However I see no (simple) way doing this in torch 0.3.1., so you would need to upgrade to version 0.4.0:
So imagine x is your input vector:
>>> x = torch.tensor(range(10))
>>> x
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
probs is a vector containing uniform probabilities for dropout so we can check later agains our probability for dropout:
>>> probs = torch.empty(10).uniform_(0, 1)
>>> probs
tensor([ 0.9793, 0.1742, 0.0904, 0.8735, 0.4774, 0.2329, 0.0074,
0.5398, 0.4681, 0.5314])
Now we apply the dropout probabilities probs on our input x:
>>> torch.where(probs > 0.2, x, torch.zeros(10, dtype=torch.int64))
tensor([ 0, 0, 0, 3, 4, 5, 0, 7, 8, 9])
Note: To see some effect I chose a dropout probability of 0.2 here. I reality you probably want it to be smaller.
You can pick for this any token / id you like, here is an example with 42 as unknown token id:
>>> unk_token = 42
>>> torch.where(probs > 0.2, x, torch.empty(10, dtype=torch.int64).fill_(unk_token))
tensor([ 0, 42, 42, 3, 4, 5, 42, 7, 8, 9])
torch.where comes with PyTorch 0.4.0:
https://pytorch.org/docs/master/torch.html#torch.where
I don't know about the shapes of your network, but your forward() should look something like this then (when using mini-batching you need to flatten the input before applying dropout):
def forward_train(self, x, l):
# probabilities
probs = torch.empty(x.size(0)).uniform_(0, 1)
# applying word dropout
x = torch.where(probs > 0.02, x, torch.zeros(x.size(0), dtype=torch.int64))
# continue like before ...
x = self.pad(x)
embedded_x = self.embedding(x)
embedded_x = embedded_x.view(-1, 1, x.size()[1] * self.params["word_dim"]) # [batch_size, 1, seq_len * word_dim]
features = [F.relu(conv(embedded_x)) for conv in self.convs]
pooled = [F.max_pool1d(feat, feat.size()[2]).view(-1, params["feature_num"]) for feat in features]
pooled = torch.cat(pooled, 1)
pooled = self.dropout(pooled)
logit = self.fc(pooled)
return logit
Note: I named the function forward_train() so you should use another forward() without dropout for evaluation / predicting. But you could also use some if conditions with train().
Option two: using torch.utils.data.Dataset:
If you're using Dataset provided by torch.utils it is very easy to do this kind of pre-processing efficiently. Dataset uses strong multi-processing acceleration by default so the the code sample above just has to be executed in the __getitem__ method of your Dataset class.
This could look like this:
def __getitem__(self, index):
'Generates one sample of data'
# Select sample
ID = self.input_tokens[index]
# Load data and get label
# using add ink_unk function from code above
X = torch.LongTensor(add_unk(ID, p=0.01))
y = self.targets[index]
return X, y
This is a bit out of context and doesn't look very elegant but I think you get the idea. According to this blog post of Shervine Amidi at Stanford it should be no problem to do more complex pre-processing steps in this function:
Since our code [Dataset is meant] is designed to be multicore-friendly, note that you
can do more complex operations instead (e.g. computations from source
files) without worrying that data generation becomes a bottleneck in
the training process.
The linked blog post - "A detailed example of how to generate your data in parallel with PyTorch" - provides also a good guide for implementing the data generation with Dataset and DataLoader.
I guess you'll prefer option one - only two lines and it should be very efficient. :)
Good luck!