What is the meaning of the two Dense in this code?
self.model.add(Flatten())
self.model.add(Dense(512))
self.model.add(Activation('relu'))
self.model.add(Dropout(0.5))
self.model.add(Dense(10))
self.model.add(Activation('softmax'))
self.model.summary()
Dense is the only actual network layer in that model.
A Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer.
It's the most basic layer in neural networks.
A Dense(10) has ten neurons. A Dense(512) has 512 neurons.
Furthermore, a dense layers applies the a non-linear transform:
f(W.X + b)
As to the effect, well in the case that W and X are a 2D tensor W.X + b is a vector and f is a element wise non-linearity like tanh, so the result is just a vector of size in the numbers of neurons
From the keras docs:
Dense implements the operation: output = activation(dot(input, kernel)
bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created
by the layer, and bias is a bias vector created by the layer (only
applicable if use_bias is True).
Related
In a convolutional layer with n neurons, trained for inputs with dimension h x w x c (height x width x channel), c usually being 3 (RGB), one trains n x c kernels of size k x k (and n bias values). So for each neuron i in the layer and each channel j in the input, we have a weight matrix of size k x k, we call weights_ij. The output of each neuron i=1,..,n (for input X) is as follows:
out_i = sigma ( tmp_i + bias_i)
with tmp_i = sum_{j=1,...,c} conv(X, weights_ij).
The output is then h_new x w_new x n. So basically the depth of the output coincides with the number of neurons in the first layer. h_new and w_new depend on padding and stride in the convolution.
This makes sense to me and I also checked it by coding the convolution and the summation myself and comparing the result with the result of a keras model, that only consists of this one layer. Now my acutal question: when we add a second convolutional layer, my understanding was that the output from the first layer is now a "picture" with n channels and we do exactly the same as before but with c=n (and a new number n2 of neurons in our 2nd layer).
But I also coded that and compared it with the prediction of a keras model with 2 convolutional layers and now the result is not the same. So does anyone know how the 2nd convolutional layer treats the output of the first?
Ok, I solved my problem.
Actually the problem was already present for just one layer and by stacking 2 layers the errors accumulated.
I thought when using stride=2 in the convolutional layer, one applys the convolution to the sections [0:N_k,0:N_k], [2:2+N_k,2:2+N_k], [4:4+N_k,4:4+N_k],... of the input but keras actually applys the convolution to [1:1+N_k,1:1+N_k], [3:3+N_k,3:3+N_k],...
My question
I'm using the Keras to build a convolutional neural network. I ran across the following:
model = tf.keras.Sequential()
model.add(layers.Dense(10*10*256, use_bias=False, input_shape=(100,)))
I'm curious - what exactly mathematically is going on here?
My best guess
My guess is that for input of size [100,N], the network will be evaluated N times, once for each training example. The Dense layer created by layers.Dense contains (10*10*256) * (100) parameters that will be updated during backpropagation.
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
Note: If the input to the layer has a rank greater than 2, then it is
flattened prior to the initial dot product with kernel.
Example:
# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)
# after the first layer, you don't need to specify
# the size of the input anymore:
model.add(Dense(32))
Arguments :
> units: Positive integer, dimensionality of the output space.
> activation: Activation function to use. If you don't specify anything,
> no activation is applied (ie. "linear" activation: a(x) = x).
> use_bias: Boolean, whether the layer uses a bias vector.
> kernel_initializer: Initializer for the kernel weights matrix.
> bias_initializer: Initializer for the bias vector.
>kernel_regularizer:Regularizer function applied to the kernel weights matrix.
> bias_regularizer: Regularizer function applied to the bias vector.
> activity_regularizer: Regularizer function applied to the output of the layer (its "activation")..
>kernel_constraint: Constraint function applied to the kernel weights matrix.
>bias_constraint: Constraint function applied to the bias vector.
Input shape:
N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).
Output shape:
N-D tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units).
Previous layer is embedding size (V clasess,K -outputdim) - I want to introduce a weights matrix size K x T. The weights will be trainable (as will the embeddings).They generate a VxT matrix will be used downstream.
1) How might I go about this?
2) Will this mess with the gradients?
It's basically vector x Matrix .
Example- embedding vocab = 10, dim K =4. so for a particular member of vocabulary, my embedding weights is a vector size (1,4) (think row vector).
For each row vector I want to multiply a weight matrix size 4x10, yielding a 1 x 10 vector (or layer) . The weight matrix is common to all members of the vocabulary.
This 1 x 10 vector will be input for the next layer.
What you want is a Dense layer, just without a bias. A Dense layer internally has a matrix that is common for all inputs, it does not vary with the input.
So this can be implemented as:
x = Dense(10, use_bias=False)(some_input_tensor)
No activation function is needed since you just want the matrix multiplication.
I intend to feed all outputs of timesteps from a LSTM to a fully-connected layer. However, the following codes fail. How can I reduce 3D output of LSTM to 2D by concatenating each output of timestep?
X = LSTM(units=128,return_sequences=True)(input_sequence)
X = Dropout(rate=0.5)(X)
X = LSTM(units=128,return_sequences=True)(X)
X = Dropout(rate=0.5)(X)
X = Concatenate()(X)
X = Dense(n_class)(X)
X = Activation('softmax')(X)
You can use the Flatten layer to flatten the 3D output of LSTM layer to a 2D shape.
As a side note, it is better to use dropout and recurrent_dropout arguments of LSTM layer instead of using Dropout layer directly with recurrent layers.
Additional to #todays answer:
It seems like you want to use return_sequences just to concatenate it into a dense layer. If you did not already try it with return_sequeunces=False, I would recommend you to do to so. The main purpose of return_sequences is to stack LSTMS or to make seq2seq predictions. In your case it should be enough to just use the LSTM.
I am implementing a custom loss function in keras. The model is an autoencoder. The first layer is an Embedding layer, which embed an input of size (batch_size, sentence_length) into (batch_size, sentence_length, embedding_dimension). Then the model compresses the embedding into a vector of a certain dimension, and finaly must reconstruct the embedding (batch_size, sentence_lenght, embedding_dimension).
But the embedding layer is trainable, and the loss must use the weights of the embedding layer (I have to sum over all word embeddings of my vocabulary).
For exemple, if I want to train on the toy exemple : "the cat". The sentence_length is 2 and suppose embedding_dimension is 10 and the vocabulary size is 50, so the embedding matrix has shape (50,10). The Embedding layer's output X is of shape (1,2,10). Then it passes in the model and the output X_hat, is also of shape (1,2,10). The model must be trained to maximize the probability that the vector X_hat[0] representing 'the' is the most similar to the vector X[0] representing 'the' in the Embedding layer, and same thing for 'cat'. But the loss is such that I have to compute the cosine similarity between X and X_hat, normalized by the sum of cosine similarity of X_hat and every embedding (50, since the vocabulary size is 50) in the embedding matrix, which are the columns of the weights of the embedding layer.
But How can I access the weights in the embedding layer at each iteration of the training process?
Thank you !
It seems a bit crazy but it seems to work : instead of creating a custom loss function that I would pass in model.compile, the network computes the loss (Eq. 1 from arxiv.org/pdf/1708.04729.pdf) in a function that I call with Lambda :
loss = Lambda(lambda x: similarity(x[0], x[1], x[2]))([X_hat, X, embedding_matrix])
And the network has two outputs: X_hat and loss, but I weight X_hat to have 0 weight and loss to have all the weight :
model = Model(input_sequence, [X_hat, loss])
model.compile(loss=mean_squared_error,
optimizer=optimizer,
loss_weights=[0., 1.])
When I train the model :
for i in range(epochs):
for j in range(num_data):
input_embedding = model.layers[1].get_weights()[0][[data[j:j+1]]]
y = [input_embedding, 0] #The embedding of the input
model.fit(data[j:j+1], y, batch_size=1, ...)
That way, the model is trained to tend loss toward 0, and when I want to use the trained model's prediction I use the first output which is the reconstruction X_hat