In training Discriminator D(), do we still use target vectors [1,1,1,...] and [0,0,0,...], respectfully, for positive and negative targets for D(), when using bce_logits_loss()?
Intention:
import torch
bce_logits_loss = nn.BCEWithLogitsLoss()
x = torch.ones(1000)*.5
ones = torch.ones(1000)
zeros = torch.zeros(1000)
err1 = bce_logits_loss(x, ones)
err2 = bce_logits_loss(x, zeros)
If you are training your discriminator with a binary cross entropy then yes you should be using ones and zeros as target. It essentially comes down to a two classifier, output is either zero or one. You can refer to the PyTorch DCGAN tutorial where they use this approach.
Related
Suppose I need to build a network that takes two inputs:
A patient's information, represented as an array of features
Selected treatment, represented as one-hot encoded array
Now how do I build a network that outputs a 2D probability matrix A where A[i,j] represents the probability the patient will end up at state j under treatment i. Let's say there are n possible states, and under any treatment, the total probability of all n states sums up to 1.
I wanted to do this because I was motivated by a similar network, where the inputs are the same as above, but the output is a 1d array representing the expected lifetime after treatment i is delivered. And such network is built as follows:
def default_dense(feature_shape, n_treatment):
feature_input = keras.layers.Input(feature_shape)
treatment_input = keras.layers.Input((n_treatments,))
hidden_1 = keras.layers.Dense(16, activation = 'relu')(feature_input)
hidden_2 = keras.layers.Dense(16, activation = 'relu')(hidden_1)
output = keras.layers.Dense(n_treatments)(hidden_2)
output_on_action = keras.layers.multiply([output, treatment_input])
model = keras.models.Model([feature_input, treatment_input], output_on_action)
model.compile(optimizer=tf.optimizers.Adam(0.001),loss='mse')
return model
And the training is simply
model.fit(x = [features, encoded_treatments], y = encoded_treatments * lifetime[:, np.newaxis], verbose = 0)
This is super handy because when predicting, I can use np.ones() as the encoded_treatments, and the network gives expected lifetimes under all treatments, thus choosing the best one is one-step. Certainly I can create multiple networks, each for a treatment, but it would be much less efficient.
Now the questions is, can I do the same to probability output?
I have figured it out myself. The trick is to use RepeatVector() and Permute() layers to generate a matrix mask for treatments.
The output is an element-wise Multiply() of the mask and a Softmax() of same size.
I want to use python3 to build a zeroinflatedpoisson model. I found in library statsmodel the function statsmodels.discrete.count_model.ZeroInflatePoisson.
I just wonder how to use it. It seems I should do:
ZIFP(Y_train,X_train).fit().
But when I wanted to do prediction using X_test.
It told me the length of X_test doesn't fit X_train.
Or is there another package to fit this model?
Here is the code I used:
X1 = [random.randint(0,1) for i in range(200)]
X2 = [random.randint(1,2) for i in range(200)]
y = np.random.poisson(lam = 2,size = 100).tolist()
for i in range(100):y.append(0)
df['x1'] = x1
df['x2'] = x2
df['y'] = y
df_x = df.iloc[:,:-1]
x_train,x_test,y_train,y_test = train_test_split(df_x,df['y'],test_size = 0.3)
clf = ZeroInflatedPoisson(endog = y_train,exog = x_train).fit()
clf.predict(x_test)
ValueError:operands could not be broadcat together with shapes (140,)(60,)
also tried:
clf.predict(x_test,exog = np.ones(len(x_test)))
ValueError: shapes(60,) and (1,) not aligned: 60 (dim 0) != 1 (dim 0)
This looks like a bug to me.
As far as I can see:
If there are no explanatory variables, exog_infl, specified for the inflation model, then a array of ones is used to model a constant inflation probability.
However, if exog_infl in predict is None, then it uses the model.exog_infl which is an array of ones with the length equal to the training sample.
As work around specifying a 1-D array of ones of correct length in predict should work.
Try:
clf.predict(test_x, exog_infl=np.ones(len(test_x))
I guess the same problem will occur if exposure was used in the model, but is not explicitly specified in predict.
I ran into the same problem, landing me on this thread. As noted by Josef, it seems like you need to provide exog_infl with a 1-D array of ones of correct length to work.
However, the code Josef provided misses the 1-D array-part, so the full line required to generate the required array is actually
clf.predict(test_x, exog_infl=np.ones((len(test_x),1))
This release of PyTorch seems provide the PackedSequence for variable lengths of input for recurrent neural network. However, I found it's a bit hard to use it correctly.
Using pad_packed_sequence to recover an output of a RNN layer which were fed by pack_padded_sequence, we got a T x B x N tensor outputs where T is the max time steps, B is the batch size and N is the hidden size. I found that for short sequences in the batch, the subsequent output will be all zeros.
Here are my questions.
For a single output task where the one would need the last output of all the sequences, simple outputs[-1] will give a wrong result since this tensor contains lots of zeros for short sequences. One will need to construct indices by sequence lengths to fetch the individual last output for all the sequences. Is there more simple way to do that?
For a multiple output task (e.g. seq2seq), usually one will add a linear layer N x O and reshape the batch outputs T x B x O into TB x O and compute the cross entropy loss with the true targets TB (usually integers in language model). In this situation, do these zeros in batch output matters?
Question 1 - Last Timestep
This is the code that i use to get the output of the last timestep. I don't know if there is a simpler solution. If it is, i'd like to know it. I followed this discussion and grabbed the relative code snippet for my last_timestep method. This is my forward.
class BaselineRNN(nn.Module):
def __init__(self, **kwargs):
...
def last_timestep(self, unpacked, lengths):
# Index of the last output for each sequence.
idx = (lengths - 1).view(-1, 1).expand(unpacked.size(0),
unpacked.size(2)).unsqueeze(1)
return unpacked.gather(1, idx).squeeze()
def forward(self, x, lengths):
embs = self.embedding(x)
# pack the batch
packed = pack_padded_sequence(embs, list(lengths.data),
batch_first=True)
out_packed, (h, c) = self.rnn(packed)
out_unpacked, _ = pad_packed_sequence(out_packed, batch_first=True)
# get the outputs from the last *non-masked* timestep for each sentence
last_outputs = self.last_timestep(out_unpacked, lengths)
# project to the classes using a linear layer
logits = self.linear(last_outputs)
return logits
Question 2 - Masked Cross Entropy Loss
Yes, by default the zero padded timesteps (targets) matter. However, it is very easy to mask them. You have two options, depending on the version of PyTorch that you use.
PyTorch 0.2.0: Now pytorch supports masking directly in the CrossEntropyLoss, with the ignore_index argument. For example, in language modeling or seq2seq, where i add zero padding, i mask the zero padded words (target) simply like this:
loss_function = nn.CrossEntropyLoss(ignore_index=0)
PyTorch 0.1.12 and older: In the older versions of PyTorch, masking was not supported, so you had to implement your own workaround. I solution that i used, was masked_cross_entropy.py, by jihunchoi. You may be also interested in this discussion.
A few days ago, I found this method which uses indexing to accomplish the same task with a one-liner.
I have my dataset batch first ([batch size, sequence length, features]), so for me:
unpacked_out = unpacked_out[np.arange(unpacked_out.shape[0]), lengths - 1, :]
where unpacked_out is the output of torch.nn.utils.rnn.pad_packed_sequence.
I have compared it with the method described here, which looks similar to the last_timestep() method Christos Baziotis is using above (also recommended here), and the results are the same in my case.
I want to make a Conv network and I wish to use the RELU activation function. Can someone please give me a clue of the correct way to initialize weights (I'm using Theano)
Thanks
I'm not sure there is a hard and fast best way to initialize weights and bias for a ReLU layer.
Some claim that (a slightly modified version of) Xavier initialization works well with ReLUs. Others that small Gaussian random weights plus bias=1 (ensuring the weighted sum of positive inputs will remain positive and thus not end up in the ReLUs zero region).
In Theano, these can be achieved like this (assuming weights post-multiply the input):
w = theano.shared((numpy.random.randn((in_size, out_size)) * 0.1).astype(theano.config.floatX))
b = theano.shared(numpy.ones(out_size))
or
w = theano.shared((numpy.random.randn((in_size, out_size)) * tt.sqrt(2 / (in_size + out_size))).astype(theano.config.floatX))
b = theano.shared(numpy.zeros(out_size))
I just applied the log loss in sklearn for logistic regression: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
My code looks something like this:
def perform_cv(clf, X, Y, scoring):
kf = KFold(X.shape[0], n_folds=5, shuffle=True)
kf_scores = []
for train, _ in kf:
X_sub = X[train,:]
Y_sub = Y[train]
#Apply 'log_loss' as a loss function
scores = cross_validation.cross_val_score(clf, X_sub, Y_sub, cv=5, scoring='log_loss')
kf_scores.append(scores.mean())
return kf_scores
However, I'm wondering why the resulting logarithmic losses are negative. I'd expect them to be positive since in the documentation (see my link above) the log loss is multiplied by a -1 in order to turn it into a positive number.
Am I doing something wrong here?
Yes, this is supposed to happen. It is not a 'bug' as others have suggested. The actual log loss is simply the positive version of the number you're getting.
SK-Learn's unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.
This is also described in sklearn GridSearchCV with Pipeline and in scikit-learn cross validation, negative values with mean squared error
a similar discussion can be found here.
In this way, an higher score means better performance (less loss).
I cross checked the sklearn implementation with several other methods. It seems to be an actual bug within the framework. Instead consider the follwoing code for calculating the log loss:
import scipy as sp
def llfun(act, pred):
epsilon = 1e-15
pred = sp.maximum(epsilon, pred)
pred = sp.minimum(1-epsilon, pred)
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pred)))
ll = ll * -1.0/len(act)
return ll
Also take into account that the dimensions of act and pred have to Nx1 column vectors.