Does PyTorch loss() and backpropagation understand lambda layers? - pytorch

I've been working with a resnet56 model from the code provided here: https://github.com/akamaster/pytorch_resnet_cifar10/blob/master/resnet.py.
I noticed that the implementation is different from many of the other available ResNet examples online, and I was wondering if PyTorch's backpropagation algorithm using loss() can account for the lambda layer and shortcut in the code provided.
If that is the case, can anyone provide insight into how PyTorch is able to interpret the lambda layer for backpropagation (i.e. how does PyTorch know how to differentiate with respect to the layer's operations)?
P.S. I also had to modify the code to fit my own use-case, and it seems like my own implementation with option == 'A' does not produce great results. This may simply be because option == 'B,' which uses convolutional layers instead of padding, is better for my data.
self.shortcut = nn.Sequential()
if stride != 1 or in_planes != planes:
if option == 'A':
top = (int) ((self.expansion*planes - in_planes) / 2)
bot = (self.expansion*planes - in_planes) - top
self.shortcut = LambdaLayer(lambda x:
F.pad(x[:, :, ::stride, ::stride], (0, 0, 0, 0, top, bot), "constant", 0))

"I was wondering if PyTorch's backpropagation algorithm using loss() can account for the lambda layer and shortcut in the code provided."
PyTorch has no problem with backpropagating through lambda functions. Your LambdaLayer is just defining the forward pass of the Module as the evaluation of the lambda function, so your question boils down to whether PyTorch can backpropagate through lambda functions.
"If that is the case, can anyone provide insight into how PyTorch is able to interpret the lambda layer for backpropagation (i.e. how does PyTorch know how to differentiate with respect to the layer's operations)?"
The lambda function performs the torch.nn.functional.Pad function on x, which we can packpropagate through because it is has a defined backwards() function.
PyTorch handles lambda functions the same way an autodiff tool like PyTorch handles any function: it breaks it up into primitive operations, and uses the differentiation rules for each primitive operation to build up the derivative of the entire computation.

Related

How pytorch implement forward for a quantized linear layer?

I have a quantized model in pytorch and now I want to extract the parameter of the quantized linear layer and implement the forward manually.
I search the source code but only find this function.
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.ops.quantized.linear(
x, self._packed_params._packed_params, self.scale, self.zero_point)
But no where I can find how torch.ops.quantized.linear is defined.
Can someone give me a hind how the forward of quantized linear are defined?
In answer to the question of where torch.ops.quantized.linear is, I was looking for the same thing but was never able to find it. I believe it's probably somewhere in the aten (C++ namespace). I did, however, find some useful PyTorch-based implementations in the NVIDIA TensorRT repo below. It's quite possible these are the ones actually called by PyTorch via some DLLs. If you're trying to add quantization to a custom layer, these implementations walk you through it.
You can find the docs here and the GitHub page here.
For the linear layer specifically, see the QuantLinear layer here
Under the hood, this calls TensorQuantFunction.apply() for post-training quantization or FakeTensorQuantFunction.apply() for quantization-aware training.

How to create a custom keras loss function with opencv?

I'm developing a machine learning model using keras and I notice that the available losses functions are not giving the best results on my test set.
I am using an Unet architecture, where I input a (16,16,3) image and the net also outputs a (16,16,3) picture (auto-encoder). I notice that maybe one way to improve the model would be if I used a loss function that compares pixel to pixel on the gradients (laplacian) between the net output and the ground truth. However, I did not found any tutorial that would handle this kind of application, because it would need to use opencv laplacian function on each output image from the net.
The loss function would be something like this:
def laplacian_loss(y_true, y_pred):
# y_true already is the calculated gradients, only needs to compute on the y_pred
# calculates the gradients for each predicted image
y_pred_lap = []
for img in y_pred:
laplacian = cv2.Laplacian( np.float64(img), cv2.CV_64F )
y_pred_lap.append( laplacian )
y_pred_lap = np.array(y_pred_lap)
# mean squared error, according to keras losses documentation
return K.mean(K.square(y_pred_lap - y_true), axis=-1)
Has anyone done something like that for loss calculation?
Given the code above, it seems that it would be equivalent to using a Lambda() layer as the output layer that applies that transformation in the image, before considering the mean square error.
Regardless as whether it is implemented as a Lambda() layer or in the loss function; the transformation needs to be such that Tensorflow understands how to calculate the gradients. The simplest was to do this would probably be to reimplement the cv2.Laplacian computation using Tensorflow math operations.
In order to use the cv2 library directly, you need to create a function that calculates the gradients for what happens inside the cv2 lib; that seems significantly more error prone.
Gradient descent optimisation relies on being able to compute gradients from the inputs to the loss; and back. Any operation in the middle must be differentiable; and Tensorflow must understand the math operations for auto differentiation to work; or you need to add them manually.
I managed to reach a easy solution. The main feature was that the gradient calculation is actually a 2D filter. For more information about it, please follow the link about the laplacian kernel. In that matter, is necessary that the output of my network be filtered by the laplacian kernel. For that, I created an extra convolutional layer with fixed weights, exactly as the laplacian kernel. After that, the network will have two outputs (one been the desired image, and the other been the gradient's image). So, is also necessary to define both losses.
To make it clearer, I'll exemplify. In the end of the network you'll have something like:
channels = 3 # number of channels of network output
lap = Conv2D(channels , (3,3), padding='same', name='laplacian') (net_output)
model = Model(inputs=[net_input], outputs=[net_out, lap])
Define how you want to calculate the losses for each output:
# losses for output, laplacian and gaussian
losses = {
"enhanced": "mse",
"laplacian": "mse"
}
lossWeights = {"enhanced": 1.0, "laplacian": 0.6}
Compile the model:
model.compile(optimizer=Adam(), loss=losses, loss_weights=lossWeights)
Define the laplacian kernel, apply its values in the weights of the above convolutional layer and set trainable equals False (so it won't be updated).
bias = np.asarray([0]*3)
# laplacian kernel
l = np.asarray([
[[[1,1,1],
[1,-8,1],
[1,1,1]
]]*channels
]*channels).astype(np.float32)
bias = np.asarray([0]*3).astype(np.float32)
wl = [l,bias]
model.get_layer('laplacian').set_weights(wl)
model.get_layer('laplacian').trainable = False
When training, remember that you need two values for the ground truth:
model.fit(x=X, y = {"out": y_out, "laplacian": y_lap})
Observation: Do not utilize the BatchNormalization layer! In case you use it, the weights in the laplacian layer will be updated!

Normalization of input data in Keras

One common task in DL is that you normalize input samples to zero mean and unit variance. One can "manually" perform the normalization using code like this:
mean = np.mean(X, axis = 0)
std = np.std(X, axis = 0)
X = [(x - mean)/std for x in X]
However, then one must keep the mean and std values around, to normalize the testing data, in addition to the Keras model being trained. Since the mean and std are learnable parameters, perhaps Keras can learn them? Something like this:
m = Sequential()
m.add(SomeKerasLayzerForNormalizing(...))
m.add(Conv2D(20, (5, 5), input_shape = (21, 100, 3), padding = 'valid'))
... rest of network
m.add(Dense(1, activation = 'sigmoid'))
I hope you understand what I'm getting at.
Add BatchNormalization as the first layer and it works as expected, though not exactly like the OP's example. You can see the detailed explanation here.
Both the OP's example and batch normalization use a learned mean and standard deviation of the input data during inference. But the OP's example uses a simple mean that gives every training sample equal weight, while the BatchNormalization layer uses a moving average that gives recently-seen samples more weight than older samples.
Importantly, batch normalization works differently from the OP's example during training. During training, the layer normalizes its output using the mean and standard deviation of the current batch of inputs.
A second distinction is that the OP's code produces an output with a mean of zero and a standard deviation of one. Batch Normalization instead learns a mean and standard deviation for the output that improves the entire network's loss. To get the behavior of the OP's example, Batch Normalization should be initialized with the parameters scale=False and center=False.
There's now a Keras layer for this purpose, Normalization. At time of writing it is in the experimental module, keras.layers.experimental.preprocessing.
https://keras.io/api/layers/preprocessing_layers/core_preprocessing_layers/normalization/
Before you use it, you call the layer's adapt method with the data X you want to derive the scale from (i.e. mean and standard deviation). Once you do this, the scale is fixed (it does not change during training). The scale is then applied to the inputs whenever the model is used (during training and prediction).
from keras.layers.experimental.preprocessing import Normalization
norm_layer = Normalization()
norm_layer.adapt(X)
model = keras.Sequential()
model.add(norm_layer)
# ... Continue as usual.
Maybe you can use sklearn.preprocessing.StandardScaler to scale you data,
This object allow you to save the scaling parameters in an object,
Then you can use Mixin types inputs into you model, lets say:
Your_model
[param1_scaler, param2_scaler]
Here is a link https://www.pyimagesearch.com/2019/02/04/keras-multiple-inputs-and-mixed-data/
https://keras.io/getting-started/functional-api-guide/
There's BatchNormalization, which learns mean and standard deviation of the input. I haven't tried using it as the first layer of the network, but as I understand it, it should do something very similar to what you're looking for.

How to use sample weights in custom loss function in Keras?

I use custom loss function in keras. Now, I want to use sample weights in Keras.
I've searched in google and some article suggest model.fit(X,y,sample_weight= custom_weights)
But I want to use sample weight directly in custom loss function. My custom loss function quite complex and for some reason i need to process sample weight directly.
for example:
custom_weights = np.array([1,2,3,4,5,6,7,8,9,10])
#my failed attempt
def custom_loss_function(y_true, y_pred , custom_weights):
return K.mean(K.abs(y_pred - y_true) * custom_weights), axis=-1)
note: my real custom_loss_function is very complex. In this question, I use "MAE" as example to simplify the problem so we can focus to answer "how to use sample weights in custom_loss_function "
how to do this task correctly ?

why do keras-rl examples always choose linear activation in the output layer?

I'm a complete newbie to Reinforcement Learning. And I have a question about the choice of the activation function of the output layer for the keras-rl agents. In all the examples provided by keras-rl (https://github.com/matthiasplappert/keras-rl/tree/master/examples) choose linear activation function in the output layer. Why is this? What effect would we expect if I go with different activation function? For example, if I work with an OpenAI environment with a discrete action space of 5, should I also consider using softmax in the output layer for an agent?
Thanks much in advance.
For some of the agents in keras-rl linear activation function is used, even though the agents are working with discrete action spaces (for example, dqn, ddqn). But, for example, CEM uses softmax activation function for discrete action spaces (which is what one would expect).
The reason behind linear activation function for dqn and ddqn is its exploration policy, which is a part of the agent. If we consider the class of exploration policy used for both of them as an example and a method select_action, we will see the following:
class BoltzmannQPolicy(Policy):
def __init__(self, tau=1., clip=(-500., 500.)):
super(BoltzmannQPolicy, self).__init__()
self.tau = tau
self.clip = clip
def select_action(self, q_values):
assert q_values.ndim == 1
q_values = q_values.astype('float64')
nb_actions = q_values.shape[0]
exp_values = np.exp(np.clip(q_values / self.tau, self.clip[0], self.clip[1]))
probs = exp_values / np.sum(exp_values)
action = np.random.choice(range(nb_actions), p=probs)
return action
In decision-making process for every action, output of linear activation function of the last dense layer is transformed according to Boltzmann exploration policy to range [0,1], and the decision on a specific action is made according to Boltzmann exploration. That's why softmax is not used in an output layer.
You can read more about different exploration strategies and their comparison here:
https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-7-action-selection-strategies-for-exploration-d3a97b7cceaf

Resources