I would like to predict a multi-dimensional array using Long Short-Term Memory (LSTM) networks while imposing restrictions on the shape of the surface of interest.
I thought to accomplish this by setting some elements of the output (regions of the surface) in a functional relationship to others (simple scaling conditions).
Is it possible to set such custom activation functions for the output, whose argument are other output nodes, in Keras?
If not, is there any other interface that allows this? Do you have any source to a manual?

The keras-team on the GitHub answered the question about how to make a custom activation function.
There also is a question with a code with a custom activation function.
These pages may help you!
Additional comment
These pages were not enough for this question so I add the comment below;
Maybe PyTorch is better for customization than Keras. I tried to write such a network, though it is a very simple one, based on PyTorch tutorials and "Extending PyTorch with Custom Activation Functions"
I made a custom activation function in which the 1-th(counting from 0) elements of the output vector are equal to twice the 0-th elements. A very simple network with one layer was used for the training. After training, I checked that the condition was satisfied.
import torch
import matplotlib.pyplot as plt
# Define the custom activation function
# reference:
def silu(input):
input[:,1] = input[:,0] * 2
return input
class SiLU(torch.nn.Module):
def __init__(self):
super().__init__() # init the base class
def forward(self, input):
return silu(input) # simply apply already implemented SiLU
# Training
# reference:
k = 10
x = torch.rand([k,3])
y = x * 2
model = torch.nn.Sequential(
torch.nn.Linear(3, 3),
SiLU() # custom activation function
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-3
for t in range(2000):
y_pred = model(x)
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
# check the behaviour
yy = model(x) # predicted
print('ground truth')
# examples for the first five data
colorlist = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
for i in range(5):
plt.plot(y[i,:].detach().numpy(), linestyle = "solid", label = "ground truth_" + str(i), color=colorlist[i])
plt.plot(yy[i,:].detach().numpy(), linestyle = "dotted", label = "predicted_" + str(i), color=colorlist[i])
# check if the custom activation works correctly
plt.plot(yy[:,0].detach().numpy()*2, label = '0th * 2')
plt.plot(yy[:,1].detach().numpy(), label = '1th')


How to add a custom loss function to Keras that solves an ODE?

I'm new to Keras, sorry if this is a silly question!
I am trying to get a single-layer neural network to find the solution to a first-order ODE. The neural network N(x) should be the approximate solution to the ODE. I defined the right-hand side function f, and a transformed function g that includes the boundary conditions. I then wrote a custom loss function that only minimises the residual of the approximate solution. I created some empty data for the optimizer to iterate over, and set it going. The optimizer does not seem to be able to adjust the weights to minimize this loss function. Am I thinking about this wrong?
# Define initial condition
A = 1.0
# Define empty training data
x_train = np.empty((10000, 1))
y_train = np.empty((10000, 1))
# Define transformed equation (forced to satisfy boundary conditions)
g = lambda x: N(x.reshape((1000,))) * x + A
# Define rhs function
f = lambda x: np.cos(2 * np.pi * x)
# Define loss function
def OdeLoss(g, f):
def loss(y_true, y_pred):
x = np.linspace(0, 1, 1000)
R = K.sum(((g(x+epsilon)-g(x))/epsilon - f(x))**2)
return R
return loss
# Define input tensor
input_tensor = tf.keras.Input(shape=(1,))
# Define hidden layer
hidden = tf.keras.layers.Dense(32)(input_tensor)
# Define non-linear activation layer
activate = tf.keras.activations.relu(hidden)
# Define output tensor
output_tensor = tf.keras.layers.Dense(1)(activate)
# Define neural network
N = tf.keras.Model(input_tensor, output_tensor)
# Compile model
N.compile(loss=OdeLoss(g, f), optimizer='adam')
# Train model
history =, y_train, batch_size=1, epochs=1, verbose=1)
The method is based on Lecture 3.2 of MIT course 18.337J, by Chris Rackaukas, who does this in Julia. Cheers!

How to compute the parameter importance in pytorch?

I want to develop a lifelong learning system,so i need to prevent important parameter from changing.I read related paper 'Memory Aware Synapses: Learning what (not) to forget',a method was mentioned,I need to calculate the gradient of each parameter conresponding to each input image,so how should i write my code in pytorch?
'Memory Aware Synapses: Learning what (not) to forget'
You can do it using standard optimization procedure and .backward() method on your loss function.
First, scaling as defined in your link:
class Scaler:
def __init__(self, parameters, delta):
self.parameters = parameters = delta
def step(self):
"""Multiplies gradients in place."""
for param in self.parameters:
if param.grad is None:
raise ValueError("backward() has to be called before running scaler")
param.grad *=
One can use it just like optimizer.step(), see below (see comments):
model = torch.nn.Sequential(
torch.nn.Linear(10, 100), torch.nn.ReLU(), torch.nn.Linear(100, 1)
scaler = Scaler(model.parameters(), delta=0.001)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
X, y = torch.randn(64, 10), torch.randn(64)
# Optimization loop
for _ in range(EPOCHS):
output = model(X)
loss = criterion(output, y)
loss.backward() # Now model has the gradients
optimizer.step() # Optimize model's parameters
scaler.step() # Scaler gradients
optimizer.zero_grad() # Zero gradient before next step
After scaler.step() you will have gradient scaled available inside param.grad for each parameter (just like those are accessed within Scaler's step method) so you can do whatever you want with them.

Measuring uncertainty using MC Dropout on pytorch

I am trying to implement Bayesian CNN using Mc Dropout on Pytorch,
the main idea is that by applying dropout at test time and running over many forward passes , you get predictions from a variety of different models.
I’ve found an application of the Mc Dropout and I really did not get how they applied this method and how exactly they did choose the correct prediction from the list of predictions
here is the code
def mcdropout_test(model):
test_loss = 0
correct = 0
T = 100
for data, target in test_loader:
if args.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output_list = []
for i in xrange(T):
output_list.append(torch.unsqueeze(model(data), 0))
output_mean =, 0).mean(0)
test_loss += F.nll_loss(F.log_softmax(output_mean), target, size_average=False).data[0] # sum up batch loss
pred =, keepdim=True)[1] # get the index of the max log-probability
correct += pred.eq(
test_loss /= len(test_loader.dataset)
print('\nMC Dropout Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
I have replaced
data, target = Variable(data, volatile=True), Variable(target)
by adding
with torch.no_grad(): at the beginning
And this is how I have defined my CNN
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 192, 5, padding=2)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(192, 192, 5, padding=2)
self.fc1 = nn.Linear(192 * 8 * 8, 1024)
self.fc2 = nn.Linear(1024, 256)
self.fc3 = nn.Linear(256, 10)
self.dropout = nn.Dropout(p=0.3)
nn.init.constant_(self.conv1.bias, 0.0)
nn.init.constant_(self.conv2.bias, 0.0)
nn.init.constant_(self.fc1.bias, 0.0)
nn.init.constant_(self.fc2.bias, 0.0)
nn.init.constant_(self.fc3.bias, 0.0)
def forward(self, x):
x = self.pool(F.relu(self.dropout(self.conv1(x)))) # recommended to add the relu
x = self.pool(F.relu(self.dropout(self.conv2(x)))) # recommended to add the relu
x = x.view(-1, 192 * 8 * 8)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(self.dropout(x)))
x = self.fc3(self.dropout(x)) # no activation function needed for the last layer
return x
Can anyone help me to get the right implementation of the Monte Carlo Dropout method on CNN?
Implementing MC Dropout in Pytorch is easy. All that is needed to be done is to set the dropout layers of your model to train mode. This allows for different dropout masks to be used during the different various forward passes. Below is an implementation of MC Dropout in Pytorch illustrating how multiple predictions from the various forward passes are stacked together and used for computing different uncertainty metrics.
import sys
import numpy as np
import torch
import torch.nn as nn
def enable_dropout(model):
""" Function to enable the dropout layers during test-time """
for m in model.modules():
if m.__class__.__name__.startswith('Dropout'):
def get_monte_carlo_predictions(data_loader,
""" Function to get the monte-carlo samples and uncertainty estimates
through multiple forward passes
data_loader : object
data loader object from the data loader module
forward_passes : int
number of monte-carlo samples/forward passes
model : object
keras model
n_classes : int
number of classes in the dataset
n_samples : int
number of samples in the test set
dropout_predictions = np.empty((0, n_samples, n_classes))
softmax = nn.Softmax(dim=1)
for i in range(forward_passes):
predictions = np.empty((0, n_classes))
for i, (image, label) in enumerate(data_loader):
image ='cuda'))
with torch.no_grad():
output = model(image)
output = softmax(output) # shape (n_samples, n_classes)
predictions = np.vstack((predictions, output.cpu().numpy()))
dropout_predictions = np.vstack((dropout_predictions,
predictions[np.newaxis, :, :]))
# dropout predictions - shape (forward_passes, n_samples, n_classes)
# Calculating mean across multiple MCD forward passes
mean = np.mean(dropout_predictions, axis=0) # shape (n_samples, n_classes)
# Calculating variance across multiple MCD forward passes
variance = np.var(dropout_predictions, axis=0) # shape (n_samples, n_classes)
epsilon = sys.float_info.min
# Calculating entropy across multiple MCD forward passes
entropy = -np.sum(mean*np.log(mean + epsilon), axis=-1) # shape (n_samples,)
# Calculating mutual information across multiple MCD forward passes
mutual_info = entropy - np.mean(np.sum(-dropout_predictions*np.log(dropout_predictions + epsilon),
axis=-1), axis=0) # shape (n_samples,)
Moving on to the implementation which is posted in the question above, multiple predictions from T different forward passes are obtained by first setting the model to train mode (model.train()). Note that this is not desirable because unwanted stochasticity will be introduced in the predictions if there are layers other than dropout such as batch-norm in the model. Hence the best way is to just set the dropout layers to train mode as shown in the snippet above.

mse loss function not compatible with regularization loss (add_loss) on hidden layer output

I would like to code in tf.Keras a Neural Network with a couple of loss functions. One is a standard mse (mean squared error) with a factor loading, while the other is basically a regularization term on the output of a hidden layer. This second loss is added through self.add_loss() in a user-defined class inheriting from tf.keras.layers.Layer. I have a couple of questions (the first is more important though).
1) The error I get when trying to combine the two losses together is the following:
ValueError: Shapes must be equal rank, but are 0 and 1
From merging shape 0 with other shapes. for '{{node AddN}} = AddN[N=2, T=DT_FLOAT](loss/weighted_loss/value, model/new_layer/mul_1)' with input shapes: [], [100].
So it comes from the fact that the tensors which should add up to make one unique loss value have different shapes (and ranks). Still, when I try to print the losses during the training, I clearly see that the vectors returned as losses have shape batch_size and rank 1. Could it be that when the 2 losses are summed I have to provide them (or at least the loss of add_loss) as scalar? I know the mse is usually returned as a vector where each entry is the mse from one sample in the batch, hence having batch_size as shape. I think I tried to do the same with the "regularization" loss. Do you have an explanation for this behavio(u)r?
The sample code which gives me error is the following:
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
def rate_mse(rate=1e5):
#tf.function # also needed for printing
def loss(y_true, y_pred):
tmp = rate*K.mean(K.square(y_pred - y_true), axis=-1)
# tf.print('shape %s and rank %s output in mse'%(K.shape(tmp), tf.rank(tmp)))
tf.print('shape and rank output in mse',[K.shape(tmp), tf.rank(tmp)])
tf.print('mse loss:',tmp) # print when I put tf.function
return tmp
return loss
class newLayer(tf.keras.layers.Layer):
def __init__(self, rate=5e-2, **kwargs):
super(newLayer, self).__init__(**kwargs)
self.rate = rate
# #tf.function # to be commented for NN training
def call(self, inputs):
tmp = self.rate*K.mean(inputs*inputs, axis=-1)
tf.print('shape and rank output in regularizer',[K.shape(tmp), tf.rank(tmp)])
tf.print('regularizer loss:',tmp)
self.add_loss(tmp, inputs=True)
return inputs
tot_n = 10000
xx = np.random.rand(tot_n,1)
yy = np.pi*xx
train_size = int(0.9*tot_n)
xx_train = xx[:train_size]; xx_val = xx[train_size:]
yy_train = yy[:train_size]; yy_val = yy[train_size:]
reg_layer = newLayer()
input_layer = Input(shape=(1,)) # input
hidden = Dense(20, activation='relu', input_shape=(2,))(input_layer) # hidden layer
hidden = reg_layer(hidden)
output_layer = Dense(1, activation='linear')(hidden)
model = Model(inputs=[input_layer], outputs=[output_layer])
model.compile(optimizer='Adam', loss=rate_mse(), experimental_run_tf_function=False)
#model.compile(optimizer='Adam', loss=None, experimental_run_tf_function=False), yy_train, epochs=100, batch_size = 100,
validation_data=(xx_val,yy_val), verbose=1)
#new_xx = np.random.rand(10,1); new_yy = np.pi*new_xx
2) I would also have a secondary question related to this code. I noticed that printing with tf.print inside the function rate_mse only works with tf.function. Similarly, the call method of newLayer is only taken into consideration if the same decorator is commented during training. Can someone explain why this is the case or reference me to a possible solution?
Thanks in advance to whoever can provide me help. I am currently using Tensorflow 2.2.0 and keras version is 2.3.0-tf.
I stuck with the same problem for a few days. "Standard" loss is going to be a scalar at the moment when we add it to the loss from add_loss. The only way how I get it working is to add one more axis while calculating mean. So we will get a scalar, and it will work.
tmp = self.rate*K.mean(inputs*inputs, axis=[0, -1])

How to use the input gradients as variables within a custom loss function in Keras?

I am using the input gradient as feature important and want to compare the feature importance of a train datapoint with the human annotated feature importance. I would like to make this comparison differentiable such that it can be learned through backpropagation. For that, I am writing a custom loss function that in addition to the regular loss (e.g. m.s.e. on the prediction vs true labels) also checks whether the input gradient is correct (e.g. m.s.e. of the input gradient vs the human annotated feature importance).
With the following code I am able to get the input gradient:
from keras import backend as K
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense
def normalize(x):
# utility function to normalize a tensor by its L2 norm
return x / (K.sqrt(K.mean(K.square(x))) + 1e-5)
# Amount of training samples
N = 1000
input_dim = 10
# Generate training set make the 1st and 2nd feature same as the target feature
X = np.random.standard_normal(size=(N, input_dim))
y = np.random.randint(low=0, high=2, size=(N, 1))
X[:, 1] = y[:, 0]
X[:, 2] = y[:, 0]
# Create simple model
inputs = Input(shape=(input_dim,))
x = Dense(10, name="dense1")(inputs)
output = Dense(1, activation='sigmoid')(x)
model = Model(input=[inputs], output=output)
# Compile and fit model
model.compile(optimizer='adam', loss="mse", metrics=['accuracy'])[X], y, epochs=100, batch_size=64)
# Get function to get input gradients
gradients = K.gradients(model.output, model.input)[0]
gradient_function = K.function([model.input], [normalize(gradients)])
# Get input gradient values of the training-set
grads_val = gradient_function([X])[0]
This prints the following (you can see that the 1st and the 2nd features have the highest importance):
[[ 1.2629046e-02 2.2765596e+00 2.1479919e+00 2.1558853e-02
4.5277486e-03 2.9851785e-03 9.5279224e-04 -1.0903150e-02
-1.2230731e-02 2.1960819e-02]
[ 1.1318034e-02 2.0402350e+00 1.9250139e+00 1.9320872e-02
4.0577268e-03 2.6752844e-03 8.5390132e-04 -9.7713526e-03
-1.0961102e-02 1.9681118e-02]]
How can I write a custom loss function in which the input gradients are differentiable?
I started with the following loss function.
from keras.losses import mean_squared_error
def custom_loss():
# human annotated feature importance
# Let's say that it says to only look at the second feature
human_feature_importance = []
for i in range(N):
def loss(y_true, y_pred):
# Get regular loss
regular_loss_value = mean_squared_error(y_true, y_pred)
# Somehow get the input gradient of each training sample as a tensor
# It should be differential w.r.t. all of the weights
gradients = ??
feature_importance_loss_value = mean_squared_error(gradients, human_feature_importance)
# Combine the both losses
return regular_loss_value + feature_importance_loss_value
return loss
I also found an implementation in tensorflow to make the input gradient differentialble:
