I try add L1 and L2 regulazation in my Loss function. But I fail.
My code:
criterion = nn.NLLLoss() + nn.L1Loss()
it should be perfect for something like this:
criterion = nn.NLLLoss() + _lambda * nn.L1Loss()
How can i do it?
You need to first instantiate them both and them add them. they both expect two arguments :
nll_loss = nn.NLLLoss()
l1_loss = nn.L1Loss()
loss = nll_loss(x, y) + _lambda * l1_loss(x, y)
Related
Is there a more efficient way to compute Jacobian (there must be, it doesn't even run for a single batch) I want to compute the loss as given in the self-explanatory neural network. Input has a shape of (32, 365, 3) where 32 is the batch size. The loss I want to minimize is Equation 3 of the paper.
I believe that I am not using the GradientTape optimally.
def compute_loss_theta(tape, parameter, concept, output, x):
b = x.shape[0]
in_dim = (x.shape[1], x.shape[2])
feature_dim = in_dim[0]*in_dim[1]
J = tape.batch_jacobian(concept, x)
grad_fx = tape.gradient(output, x)
grad_fx = tf.reshape(grad_fx,shape=(b, feature_dim))
J = tf.reshape(J, shape=(b, feature_dim, feature_dim))
parameter = tf.expand_dims(parameter, axis =1)
loss_theta_matrix = grad_fx - tf.matmul(parameter, J)
loss_theta = tf.norm(loss_theta_matrix)
return loss_theta
for i in range(10):
for x, y in train_dataset:
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
parameter, concept, output = model(x)
loss_theta = compute_loss_theta(tape, parameter, concept, output , x)
loss_y = loss_object(y_true=y, y_pred=output)
loss_value = loss_y + eps*loss_theta
gradients = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
I want to custom a loss function based on the y_true values. y_true is a binary value. For each mini-batch, I want to treat y_true==0 and y_true==1 differently. Currently, I have:
def custom_loss(y_true, y_pred):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
zero = tf.fill(tf.shape(y_true_f), 0.0)
one = tf.fill(tf.shape(y_true_f), 1.0)
mask_0 = tf.equal(y_true_f,zero)
mask_1 = tf.equal(y_true_f,one)
y_pred_1 = tf.boolean_mask(y_pred_f,mask_1)
y_pred_0 = tf.boolean_mask(y_pred_f,mask_0)
y_true_1 = tf.boolean_mask(y_true_f,mask_1)
y_true_0 = tf.boolean_mask(y_true_f,mask_0)
loss1 = K.binary_crossentropy(y_true_1,y_pred_1)
loss0 = K.binary_crossentropy(y_true_0,y_pred_0)
loss = loss1 +a*loss0 # a is an arbitrary number
However, I got an nan loss error. I guess it is because I am training on an imbalance data where only a few cases having y_true==1. So when there is no y_true==1 in this minibatch, there is nan. I want to add if condition based on the shape of mask_1. How can I do that?
you can achieve this with the same technique as cross entropy loss function. Here is the function loss = ((y_true)*(Loss1)) + ((1 - y_true)*(Loss2)), so if your y_true = 0, first term will be equal to zero and result in loss = ((0)*Loss1) + ((1 - 0)*Loss2) = Loss2. If your y_true = 1 , your second term will become zero, loss = ((1)*Loss1) + ((1 - 1)*Loss2) = Loss1
Therefore, you can have 2 Loss function depend on your y_true = {0,1}
Given a certain text input, I'm trying to create an output that has the same semantic encoding as the input. For that I trained an autoencoder and only kept the encoder part to compare the sequence embeddings. This is the code that does the training of the new decoder:
with tf.GradientTape() as gen_tape:
enc_output, enc_hidden = enc(input_batch, enc_hidden)
gen_hidden = enc_hidden
all_outputs = [[tokenizer.word_index[START_TOKEN]] * BATCH_SIZE]
gen_input = tf.expand_dims([tokenizer.word_index[START_TOKEN]] * BATCH_SIZE, 1) #First input is list of start tokens
gen_loss = 0
for t in range(1, input_batch.shape[1]):
predictions, gen_hidden, _ = gen(gen_input, gen_hidden, enc_output)
predictions_am = tf.expand_dims(tf.argmax(predictions, 1), 1) #take most likely prediction for each row
all_outputs.append(tf.argmax(predictions, 1))
gen_input = predictions_am #predicted IDs are fed back into the model
all_outputs = tf.stack(all_outputs, 1) #build list of full length predictions
#Get the embedding vectors for original and predictions
e1 = enc(all_outputs, enc.get_def_hidden_state())[0]
e2 = enc_output
gen_loss = -tf.keras.losses.cosine_similarity(e1, e2) + 1 #calculate loss based on how similar they are
gen_grads = gen_tape.gradient(gen_loss, gen.trainable_weights)
gen_optimizer.apply_gradients(zip(gen_grads, gen.trainable_weights))
gen_grads always ends up being a list of None
Argmax is not differntiable. You can't have it as model outputs for loss calculation. You need to keep the one-hot predictions as they are until the end.
What is the proper way to clip ReLU activations with a learnable threshold? Here's how I implemented it, however I'm not sure if this is correct:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.act_max = nn.Parameter(torch.Tensor([0]), requires_grad=True)
self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
self.pool = nn.MaxPool2d(2, 2)
self.relu = nn.ReLU()
self.linear = nn.Linear(64 * 5 * 5, 10)
def forward(self, input):
conv1 = self.conv1(input)
pool1 = self.pool(conv1)
relu1 = self.relu(pool1)
relu1[relu1 > self.act_max] = self.act_max
conv2 = self.conv2(relu1)
pool2 = self.pool(conv2)
relu2 = self.relu(pool2)
relu2 = relu2.view(relu2.size(0), -1)
linear = self.linear(relu2)
return linear
model = Net()
torch.nn.init.kaiming_normal_(model.parameters)
nn.init.constant(model.act_max, 1.0)
model = model.cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(100):
for i in range(1000):
output = model(input)
loss = nn.CrossEntropyLoss()(output, label)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.act_max.data = model.act_max.data - 0.001 * model.act_max.grad.data
I had to add the last line because without it the value would not update for some reason.
UPDATE: I am now trying a method to compute the uppper bound (act_max) based on the gradients for activations:
For all activations above the threshold (relu1[relu1 > self.act_max]), look at their gradients: compute the average direction all these gradients point to.
For all positive activations below the threshold, compute the average gradient of which direction they want to change to.
The sum of these average gradients determines the direction and magnitude of the change for act_max.
There are two problems with that code.
The implementation-level one is that you're using an in-place operation which generally doesn't work well with autograd. Instead of
relu1[relu1 > self.act_max] = self.act_max
you should use an out-of-place operation like
relu1 = torch.where(relu1 > self.act_max, self.act_max, relu1)
The other is more general : neural networks are generally trained with gradient descent methods and threshold values can have no gradient - the loss function is not differentiable with respect to thresholds.
In your model you're using a dirty hackaround (whether you write is as it is or use torch.where) - model.act_max.grad.data is only defined because for some elements their value is set to model.act_max. But this gradient knows nothing about why they were set to that value. To make things more concrete, lets define cutoff operation C(x, t) which defines whether x is above or below threshold t
C(x, t) = 1 if x < t else 0
and write your clipping operation as a product
clip(x, t) = C(x, t) * x + (1 - C(x, t)) * t
you can then see that the threshold t has twofold meaning: it controls when to cutoff (inside C) and it controls the value above cutoff (the trailing t). We can therefore generalize the operation as
clip(x, t1, t2) = C(x, t1) * x + (1 - C(x, t1)) * t2
The problem with your operation is that it is only differentiable with respect to t2 but not t1. Your solution ties the two together so that t1 == t2, but it is still the case that gradient descent will act as if there was no changing the threshold, only changing the above-the-threshold-value.
For this reason, in general your thresholding operation may not be learning the value you would hope it learns. This is something to keep in mind when developing your operations, but not a guarantee of failure - in fact, if you consider the standard ReLU on biased output of some linear unit, we get a similar picture. We define the cutoff operation H
H(x, t) = 1 if x > t else 0
and ReLU as
ReLU(x + b, t) = (x + b) * H(x + b, t) = (x + b) * H(x, t - b)
where we could again generalize to
ReLU(x, b, t) = (x + b) * H(x, t)
and again we can only learn b and t is implicitly following b. Yet it seems to work :)
I need to create a neural network that approximates a function given its parameters. I give four parameters to my neural network (A, x0, phi, omega) and I want to obtain, as output,
A sin(omega x + phi) + x0
(I need this net as a part of another network)
However, I am not able to train the network as I obtain a very poor convergence. Why is that?
I use a fully connected network with three layers. This is the code
def get_batches(N_batches):
A = tan( random.uniform(low=0.,high=2*pi,size=[N_batches,1]))
x0 = random.randn(N_batches,1)*10
omega = random.uniform(low=0.,high=10*pi, size=[N_batches,1])
phi = random.uniform(low=0.,high=2*pi, size=[N_batches,1])
x = linspace(0,t_max, n_max)
x = tile(x,N_batches).reshape(N_batches,n_max)
return (A*sin(omega*x+phi) + x0, hstack([A,x0,phi,omega]) )
N_batches = 80
N_epochs = 50
t_max = 5.0
n_max = 100
n_par = 4
net_layers = []
net_inp = Input(shape=(n_par,))
net_layers.append(Dense(25, input_shape=(n_par,), activation="relu"))
net_layers.append(Dense(25, activation="relu"))
net_layers.append(Dense(25, activation="relu"))
net_layers.append(Dense(n_max, activation="linear"))
net_l = net_inp
for i in range(len(net_layers)):
net_l = net_layers[i](net_l)
net = Model(net_inp, net_l)
net.compile(loss="mean_squared_error", optimizer="adam")
costs = zeros(N_epochs)
for i in range(N_epochs):
y_true, y_in = get_batches(N_batches)
costs[i]=net.train_on_batch(y_in,y_true)
Even if I train more, I don't get better results than this
picture (approximated function and real function plot for a test sample):
The plot of the cost function is quite strange:
What mistakes did I do? Thank you!