Are the weight matrices of the residual blocks already set to 0 or we need to train the weight matrices of the residual block to be close to 0?
In what cases we backpropagate through the weight matrices of the residual block? And when do we skip backpropagating through them to take the alternate route?
Back propoation happens through both paths, if you have a concat layer that concatenates a block B which is right above it, and a layer A which is cncatenated as residue, gradient to A will come from both the concat layer and the layer after A. Such skip connections are made to counter the vanishing gradients in deep network (the gradient which is backpropogated gets smaller as it passes through more layers)
Related
I'm having some trouble understanding the purpose of causal convolutions. Suppose I'm doing time-series classification using a convolutional network with 1 layer and kernel=2, stride=1, dilation=0. Isn't it the same thing as shifting my output back by 1?
For larger networks, it would be a little more involved to take into account the parameters of all the layers to get the resulting receptive field to do a proper output shift. To me it seems, if there is some leak, you could always account for the leak by shifting the output back.
For example, if at time step $t_2$, a non-causal CNN sees $x_0, x_1, x_2, x_3, x_4$, then you'd use the target associated with $t_4$, i.e. $y_4$
Edit: I've seen diagrams for causal CNNs where all the arrows a right-aligned. I get that it's meant to illustrate that $y_t$ aligns to $x_t$, but couldn't you just as easily draw them like this:
Non_Causal CNN Right-aligned
The point of causal convolutions is not to see 'future' data. This is important in real time sequential analysis because we won't have access to new information before it happens, however we typically do in training (due to having the whole training sequence). Therefore, causal convolutions begin t-k//2 and end at t (where t = current timestep and k = kernel size), rather than a typical convolution which starts at t-k//2 and end at t+k//2. This can be imagined as a 1-sided kernel, where instead of having the target pixel/sample be in the centre of the kernel, it's now the rightmost (going from L-R) part of the kernel.
Using your example, if the top orange dot in the following picture is t_n, then t_n has a receptive field stemming from t_n-4 to t_n due to it having a kernel size of 2 and 4 layers.
Compare that to a noncausal convolution (ignore the dilated convolution on the right), where the receptive field stems from t_n-3 to t_n+3 due to it being a double-sided kernel:
To implement the CNN model for classification images we need to use sigmoid and relu function. but I am confused what is the use of these.
If you are working with a conventional CNN for image classification, the output layer has N neurons, where N is the number of image classes you want to identify. You want each output neuron to represent the probability that you have observed each image class. The sigmoid function is good for representing a probability. Its domain is all real numbers, but its range is 0 to 1.
For network layers that are not output layers, you could also use the sigmoid. In theory, any non-linear transfer function will work in the inner layers of a neural network. However, there are practical reasons not to use the sigmoid. Some of those reasons are:
Sigmoid requires a fair amount of computation.
The slope of the sigmoid function is very shallow when the input is
far from zero, which slows gradient descent learning down.
Modern neural networks have many layers, and if you have several
layers in a neural network with sigmoid functions between them, it's
quite possible to end up with a zero learning rate.
The ReLU function solves many of sigmoid's problems. It is easy and fast to compute. Whenever the input is positive, ReLU has a slope of -1, which provides a strong gradient to descend. ReLU is not limited to the range 0-1, though, so if you used it it your output layer, it would not be guaranteed to be able to represent a probability.
This may be a very basic/trivial question.
For Negative Inputs,
Output of ReLu Activation Function is Zero
Output of Sigmoid Activation Function is Zero
Output of Tanh Activation Function is -1
Below Mentioned are my questions:
Why is it that all of the above Activation Functions Saturated for Negative Input Values.
Is there any Activation Function if we want to predict a Negative Target Value.
Thank you.
True - ReLU is designed to result in zero for negative values. (It can be dangerous with big learning rates, bad initialization or with very few units - all neurons can get stuck in zero and the model freezes)
False - Sigmoid results in zero for "very negative" inputs, not for "negative" inputs. If your inputs are between -3 and +3, you will see a very pleasant result between 0 and 1.
False - The same comment as Sigmoid. If your inputs are between -2 and 2, you will see nice results between -1 and 1.
So, the saturation problem only exists for inputs whose absolute values are too big.
By definition, the outputs are:
ReLU: 0 < y < inf (with center in 0)
Sigmoid: 0 < y < 1 (with center in 0.5)
TanH: -1 < y < 1 (with center in 0)
You might want to use a BatchNormalization layer before these activations to avoid having big values and avoid saturation.
For predicting negative outputs, tanh is the only of the three that is capable of doing that.
You could invent a negative sigmoid, though, it's pretty easy:
def neg_sigmoid(x):
return -keras.backend.sigmoid(x)
#use the layer:
Activation(neg_sigmoid)
In short, negative/positive doesn't matter for these activation functions.
Sigmoid and tanh is both saturated for positive and negative values. As stated in the comments, they are symmetrical to input 0. For relu, it does only saturate for negative values, but I'll explain why it doens't matter in the next question.
The answer is an activation function doesn't need to 'predict' a negative value. The point of the activation function is not to give an equation to predict your final value, but to give a non-linearity to your neural network in the middle layers. You then use some appropriate function at the last layer to get the wanted output values. ex) softmax for classification, just linear for regression.
So because these activation functions are in the middle, it really doesn't matter if the activation function only outputs positive values even if your 'wanted' values are negative, since the model will make the weights for the next layes negative.(hence the term 'wanted values are negative' doesn't mean anything)
So, Relu being saturated on the negative side is no different from it being saturated on the positive side. There are activation functions that doesn't saturated such as leaky Relu, so you may want to check it out. But the point positive/negative for activation functions doesn't matter.
The key idea behind introducing the ReLu activation function was to address the issue of vanishing gradients in deeper networks. However, for different initialization, when the weights go above 1, it could lead to explosion of gradient values and cause the network to saturate. And the key idea behind ReLu was to introduce sparsity into the network. In a easy way we can say that it just prunes the connections deemed unimportant ( that is -ve weights ). Yup, here we have to be careful in the distribution of weights we initialize or the network can end up too sparse and unable to learn more information.
Sigmoid - The key problem with sigmoid for gradient based learning rules is that the derivative of sigmoid leads to a function that goes to 0 for very large inputs. Thus causing vanishing gradients, and also sigmoid doesn't cause a problem with negative values but instead, for large positive input values.
Tanh - The idea behind tanh is to not have sparsity which is enforced by ReLu and utilize complex network dynamics for learning similar to the sigmoid function. Tanh in a simpler way, tries to use the entire network's capability to learn and addresses the vanishing gradient problem similar to ReLu. And having a negative factor in the network acts as a dynamic regularizer (negative weights are strongly pulled to -1 and weights near 0 go towards 0) and is useful for binary classification or fewer class classification problems.
This link has some good information that would be helpful for you.
In An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, the authors state that TCN networks, a specific type of 1D CNNs applied to sequential data, "can also take in inputs of arbitrary lengths by sliding the 1D convolutional kernels", just like Recurrent Nets. I am asking myself how this can be done.
For an RNN, it is straight-forward that the same function would be applied as often as is the input length. However, for CNNs (or any feed-forward NN in general), one must prespecify the number of input neurons. So the only way I can see TCNs dealing with arbitrary length inputs is by specifying a fixed length input neuron space and then adding zero padding to the arbitrary length inputs.
Am I correct in my understanding?
If you have a fully convolutional neural network, there is no reason to have a fully specified input shape. You definitely need a fixed rank, and the last dimension probably should be the same but otherwise, you can definitely specify an input shape which in tensorflow would look like Input((None, 10)) in the case of 1D-CNNs.
Indeed the shape of the convolution kernel doesn't depend on the length of the input in the temporal dimension (it can depend on the last dimension though typically in convolutional neural networks), and you can apply it to any input with the same rank (and same last dimension).
For example let's say that you are applying only a single 1D convolution, with a kernel that's doing the sum of 2 neighbouring elements (kernel = (1, 1)). This operation could be applied to any input length given it's always 1D.
However, when being confronted with a sequence-to-label task and requiring further operations in the stack such as a fully-connected layer, the inputs must be of fixed length (or must be made so through zero padding).
In Keras, if you want to add an LSTM layer with 10 units, you use model.add(LSTM(10)). I've heard that number 10 referred to as the number of hidden units here and as the number of output units (line 863 of the Keras code here).
My question is, are those two things the same? Is the dimensionality of the output the same as the number of hidden units? I've read a few tutorials (like this one and this one), but none of them state this explicitly.
The answers seems to refer to multi-layer perceptrons (MLP) in which the hidden layer can be of different size and often is. For LSTMs, the hidden dimension is the same as the output dimension by construction:
The h is the output for a given timestep and the cell state c is bound by the hidden size due to element wise multiplication. The addition of terms to compute the gates would require that both the input kernel W and the recurrent kernel U map to the same dimension. This is certainly the case for Keras LSTM as well and is why you only provide single units argument.
To get a good intuition for why this makes sense. Remember that the LSTM job is to encode a sequence into a vector (maybe a Gross oversimplification but its all we need). The size of that vector is specified by hidden_units, the output is:
seq vector RNN weights
(1 X input_dim) * (input_dim X hidden_units),
which has 1 X hidden_units (a row vector representing the encoding of your input sequence). And thus, the names in this case are used synonymously.
Of course RNNs require more than one multiplication and keras implements RNNs as a sequence of matrix-matrix multiplications instead vector-matrix shown above.
The number of hidden units is not the same as the number of output units.
The number 10 controls the dimension of the output hidden state (source code for the LSTM constructor method can be found here. 10 specifies the units argument). In one of the tutorial's you have linked to (colah's blog), the units argument would control the dimension of the vectors ht-1 , ht, and ht+1: RNN image.
If you want to control the number of LSTM blocks in your network, you need to specify this as an input into the LSTM layer. The input shape to the layer is (nb_samples, timesteps, input_dim) Keras documentation. timesteps controls how many LSTM blocks your network contains. Referring to the tutorial on colah's blog again, in RNN image, timesteps would control how many green blocks the network contains.