What is an Embedding in Keras? - keras

Keras documentation isn't clear what this actually is. I understand we can use this to compress the input feature space into a smaller one. But how is this done from a neural design perspective? Is it an autoenocder, RBM?

As far as I know, the Embedding layer is a simple matrix multiplication that transforms words into their corresponding word embeddings.
The weights of the Embedding layer are of the shape (vocabulary_size, embedding_dimension). For each training sample, its input are integers, which represent certain words. The integers are in the range of the vocabulary size. The Embedding layer transforms each integer i into the ith line of the embedding weights matrix.
In order to quickly do this as a matrix multiplication, the input integers are not stored as a list of integers but as a one-hot matrix. Therefore the input shape is (nb_words, vocabulary_size) with one non-zero value per line. If you multiply this by the embedding weights, you get the output in the shape
(nb_words, vocab_size) x (vocab_size, embedding_dim) = (nb_words, embedding_dim)
So with a simple matrix multiplication you transform all the words in a sample into the corresponding word embeddings.

The Keras Embedding layer is not performing any matrix multiplication but it only:
1. creates a weight matrix of (vocabulary_size)x(embedding_dimension) dimensions
2. indexes this weight matrix
It is always useful to have a look at the source code to understand what a class does. In this case, we will have a look at the class Embedding which inherits from the base layer class called Layer.
(1) - Creating a weight matrix of (vocabulary_size)x(embedding_dimension) dimensions:
This is occuring at the build function of Embedding:
def build(self, input_shape):
self.embeddings = self.add_weight(
shape=(self.input_dim, self.output_dim),
initializer=self.embeddings_initializer,
name='embeddings',
regularizer=self.embeddings_regularizer,
constraint=self.embeddings_constraint,
dtype=self.dtype)
self.built = True
If you have a look at the base class Layer you will see that the function add_weight above simply creates a matrix of trainable weights (in this case of (vocabulary_size)x(embedding_dimension) dimensions):
def add_weight(self,
name,
shape,
dtype=None,
initializer=None,
regularizer=None,
trainable=True,
constraint=None):
"""Adds a weight variable to the layer.
# Arguments
name: String, the name for the weight variable.
shape: The shape tuple of the weight.
dtype: The dtype of the weight.
initializer: An Initializer instance (callable).
regularizer: An optional Regularizer instance.
trainable: A boolean, whether the weight should
be trained via backprop or not (assuming
that the layer itself is also trainable).
constraint: An optional Constraint instance.
# Returns
The created weight variable.
"""
initializer = initializers.get(initializer)
if dtype is None:
dtype = K.floatx()
weight = K.variable(initializer(shape),
dtype=dtype,
name=name,
constraint=constraint)
if regularizer is not None:
with K.name_scope('weight_regularizer'):
self.add_loss(regularizer(weight))
if trainable:
self._trainable_weights.append(weight)
else:
self._non_trainable_weights.append(weight)
return weight
(2) - Indexing this weight matrix
This is occuring at the call function of Embedding:
def call(self, inputs):
if K.dtype(inputs) != 'int32':
inputs = K.cast(inputs, 'int32')
out = K.gather(self.embeddings, inputs)
return out
This functions returns the output of the Embedding layer which is K.gather(self.embeddings, inputs). What tf.keras.backend.gather exactly does is to index the weights matrix self.embeddings (see build function above) according to the inputs which should be lists of positive integers.
These lists can be retrieved for example if you pass your text/words inputs to the one_hot function of Keras which encodes a text into a list of word indexes of size n (this is NOT one hot encoding - see also this example for more info: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/).
Therefore, that's all. There is no matrix multiplication.
On the contrary, the Keras Embedding layer is only useful because exactly it avoids performing a matrix multiplication and hence it economizes on some computational resources.
Otherwise, you could just use a Keras Dense layer (after you have encoded your input data) to get a matrix of trainable weights (of (vocabulary_size)x(embedding_dimension) dimensions) and then simply do the multiplication to get the output which will be exactly the same with the output of the Embedding layer.

In Keras, the Embedding layer is NOT a simple matrix multiplication layer, but a look-up table layer (see call function below or the original definition).
def call(self, inputs):
if K.dtype(inputs) != 'int32':
inputs = K.cast(inputs, 'int32')
out = K.gather(self.embeddings, inputs)
return out
What it does is to map each a known integer n in inputs to a trainable feature vector W[n], whose dimension is the so-called embedded feature length.

In simple words (from the functionality point of view), it is a one-hot encoder and fully-connected layer. The layer weights are trainable.

Related

How can I set binary weights values (0,1) or (-1,1) to the layer in Keras?

I would like to ask if I can set the weights initializer in (any) Keras layer to binary values - for example for the weights of simple Dense layer to be 0 and 1 only? This would be helpful for instance in the case of the Conv1D layer to relax the computational time.
Thank you,
J
Yes this is possible by creating a custom initializer:
def binary_weights(shape, dtype=tf.float32):
"""This function generates weights of random 0s and 1s based on the provided shape"""
# build logits matrix:
logits = tf.fill((shape[0], 2), 0.5)
# uniformly pick the class.
return tf.cast(tf.random.categorical(tf.math.log(logits), shape[1]), dtype=dtype)
Then when you specify the layer:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(units, kernel_initializer=binary_weights, input_shape=[num_features,]),
...
])
To check the generated weights:
print(model.layers[0].get_weights()[0])

How to understand hidden_states of the returns in BertModel?(huggingface-transformers)

Returns last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)): Sequence of hidden-states at the
output of the last layer of the model.
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence
(classification token) further processed by a Linear layer and a Tanh
activation function. The Linear layer weights are trained from the
next sentence prediction (classification) objective during
pre-training.
This output is usually not a good summary of the semantic content of
the input, you’re often better with averaging or pooling the sequence
of hidden-states for the whole input sequence.
hidden_states (tuple(torch.FloatTensor), optional, returned when
config.output_hidden_states=True): Tuple of torch.FloatTensor (one for
the output of the embeddings + one for the output of each layer) of
shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the
initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when
config.output_attentions=True): Tuple of torch.FloatTensor (one for
each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length).
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
This is from https://huggingface.co/transformers/model_doc/bert.html#bertmodel. Although the description in the document is clear, I still don't understand the hidden_states of returns. There is a tuple, one for the output of the embeddings, and the other for the output of each layer.
Please tell me how to distinguish them, or what is the meaning of them? Thanks very much!![wink~
hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For a given token, its input representation is constructed by summing the corresponding token embedding, segment embedding, and position embedding. This input representation is called the initial embedding output which can be found at index 0 of the tuple hidden_states.
This figure explains how the embeddings are calculated.
The remaining 12 elements in the tuple contain the output of the corresponding hidden layer. E.g: the last hidden layer can be found at index 12, which is the 13th item in the tuple. The dimension of both the initial embedding output and the hidden states are [batch_size, sequence_length, hidden_size]. It would be useful to compare the indexing of hidden_states bottom-up with this image from the BERT paper.
last_hidden_state contains the hidden representations for each token in each sequence of the batch. So the size is (batch_size, seq_len, hidden_size).
You can refer to Difference between CLS hidden state and pooled_output for more clarification.
I find the answer in the length of this tuple. The length is (1+num_layers). And the output of the last layer is different from the embedding output, because layer output plus the initial embedding. :D

Keras Average Layer with Masking

The Average layer that comes with Keras already has support for masking, however, looking at the source code of Average Layer, it is not clear to me how and if the masking is applied.
I have a list of inputs, each with its own masking (coming from an embedding layer, for example). The average layer I want should take the average of those inputs that have not been masked. In other words, if an input is masked, it should not have any say in the calculated mean. If all the inputs are masked, then the output is masked and passed along to the next layers.
A related question is, the Average Layer that comes with the library only supports merge functions of a list of inputs. Is there library support for merging a tensor along a particular dimension? Is it possible to slice a tensor into a list of inputs to feed into the average layer? If not, how to take the average of tensor along some dimension in presence of masking?
I am inclining towards writing a custom average layer that computes the masks and consume the masks in calculating the output, but from the documentation, it is not clear how to do so?
Any pointers or code samples is highly appreciated.
If you look at the source code of the Average layer, it actually subclass from the "_Merge" layer, since Average layer doesn't overwrite the "compute_mask" function, so it will inherit the "compute_mask" function of "_Merge" layer, which is as follows:
def compute_mask(self, inputs, mask=None):
if mask is None:
return None
if not isinstance(mask, list):
raise ValueError('`mask` should be a list.')
if not isinstance(inputs, list):
raise ValueError('`inputs` should be a list.')
if len(mask) != len(inputs):
raise ValueError('The lists `inputs` and `mask` '
'should have the same length.')
if all(m is None for m in mask):
return None
masks = [array_ops.expand_dims(m, axis=0) for m in mask if m is not None]
return K.all(K.concatenate(masks, axis=0), axis=0, keepdims=False)
The last 4 lines say that: if all the input masks are None, then return None. Otherwise the output mask is first concatenating all the masks that are not None, then do a "all" operation, which means the resulted mask is masked (False) if one of the input_mask is masked (False), and the resulted mask is True (not masked) only if all the input_masks are not masked (True).

Keras ImageDataGenerator sample_weight with data augmentation

I have a question about the use of the sample_weight parameter in the context of data augmentation in Keras with the ImageDataGenerator. Let's say I have a series of simple images with just one class of objects. So, for each image, I will have a corresponding mask with pixels = 0 for the background and 1 for where the object is labeled.
However, this dataset is unbalanced because a significant amount of these images are empty, which mean with masks just containing 0.
If I understood well, the 'sample_weight' parameter of the flow method of ImageDataGenerator is here to put the focus on the the samples of my dataset that I find more interesting, i.e. where my object is present.
My question is: what is the concrete influence of this sample_weight parameter on the training of my model. Does it influence the data augmentation? If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
Here is the part of my code my question refers to:
data_gen_args = dict(rotation_range=90,
width_shift_range=0.4,
height_shift_range=0.4,
zoom_range=0.4,
horizontal_flip=True,
fill_mode='reflect',
rescale=1. / 255,
validation_split=0.2,
data_format='channels_last'
)
image_datagen = ImageDataGenerator(**data_gen_args)
imf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='training',
sample_weight = sample_weight,
save_to_dir = 'traindir',
save_prefix = 'train_'
)
valf = image_datagen.flow(
x=stacked_images_channel,
y=stacked_masks_channel,
batch_size=batch_size,
shuffle=False,
seed=seed,subset='validation',
sample_weight = sample_weight,
save_to_dir = 'valdir',
save_prefix = 'val_'
)
STEP_SIZE_TRAIN=imf.n//imf.batch_size
STEP_SIZE_VALID=valf.n//valf.batch_size
model = unet.UNet2(numberOfClasses, imshape, '', learningRate, depth=4)
history = model.fit_generator(generator=imf,
steps_per_epoch=STEP_SIZE_TRAIN,
epochs=epochs,
validation_data=valf,
validation_steps=STEP_SIZE_VALID,
verbose=2
)
Thank you in advance for your attention.
As for Keras 2.2.5 with preprocessing at 1.1.0, the sample_weight is passed along with the samples and applied during processing. When calling .fit_generator, the model is trained on batches, each batch using sample weights:
model.train_on_batch(x, y,
sample_weight=sample_weight,
class_weight=class_weight)
In the source code of .train_on_batch, the documentation states: "sample_weight: Optional array of the same length as x, containing weights to apply to the model's loss for each sample. (...)". The actual application of weights happens when calculating loss on each batch. When compiling a model, Keras generates a "weighted loss" function out of the desired loss function. The weighted computation is stated in the code as:
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
"""
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask) + K.epsilon()
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
This wrapper shows it first calculates the desired loss (call to fn(y_true, y_pred)), then applies weighing if weights where passed (either with sample_weight or class_weight).
With this context in mind:
what is the concrete influence of this sample_weight parameter on the training of my model.
Weights are basically multiplied to the loss (and normalized). So "heavy" weights (more than 1) samples cause more loss, so larger gradients. "Light" weights reduce the importance of the sample and lead to smaller gradients.
Does it influence the data augmentation?
It depends on what you mean. Here is what I can say from experience, where I perform augmentation before feeding a Keras data generator (doing so as there were issues in preprocessing, as far as I know still existing in Preprocessing 1.1.0):
When feeding already augmented data to the generator, the .flow call will require a sample weights list as long as the input data. So the influence of weighing on augmentation depends on how the weights are chosen. A data point augmented N times may assign the same weight to each augmentation, or 1/N depending on the intent.
The default behaviour in Keras seems to assign the same weight to each augmentation (transform) performed by Keras. The code looks pretty clear, although I have never relied on it.
If I use the 'validation_split' parameter, does it influence the way validation sets are generated?
The sample_weight parameter does not seem to interfere with validation_split. I have not looked into the code specifically, but splitting basically gets the input data, and keeps a split for validation---whatever the data is. When sample_weight is added, what changes is each data point: Without weight, data is (x, y); with weight, data becomes (x, y, weight).

Graph disconnected: cannot obtain value for tensor Tensor

I have to train a GAN network with Generator and Discriminator. My Generator Network is as below.
def Generator(image_shape=(512,512,3):
inputs = Input(image_shape)
# 5 convolution Layers
# 5 Deconvolution Layers along with concatenation
# output shape is (512,512,3)
model=Model(inputs=inputs,outputs=outputs, name='Generator')
return model, output
My Discriminator Network is as below. The first step in Discriminator network is that I have to concatenate the input of discriminator with output of Generator.
def Discriminator(Generator_output, image_shape=(512,512,3)):
inputs=Input(image_shape)
concatenated_input=concatenate([Generator_output, inputs], axis=-1)
# Now start applying Convolution Layers on concatenated_input
# Deconvolution Layers
return Model(inputs=inputs,outputs=outputs, name='Discriminator')
Initiating the Architectures
G, Generator_output=Generator(image_shape=(512,512,3))
G.summary
D=Discriminator(Generator_output, image_shape=(512,512,3))
D.summary()
My Problem is when I pass concatenated_input to convolution layers it gets me the following error.
Graph disconnected: cannot obtain value for tensor Tensor("input_1:0", shape=(?, 512, 512, 3), dtype=float32) at layer "input_1". The following previous layers were accessed without issue: []
If I remove the concatenation layer it works perfectly but why it's not working after concatenation layer although the shape of inputs and Generator_output in concatenation is also same i.e. (512,512,3).
The key insight that will help you here is that Models are just like layers in Keras but self contained. So to connect one model output to another, you need to say the second model receieves an input of matching shape rather than directly passing that tensor:
def Discriminator(gen_output_shape, image_shape=(512,512,3)):
inputs=Input(image_shape)
gen_output=Input(gen_output_shape)
concatenated_input=concatenate([gen_output, inputs], axis=-1)
# Now start applying Convolution Layers on concatenated_input
# Deconvolution Layers
return Model(inputs=[inputs, gen_output],outputs=outputs, name='Discriminator')
And then you can use it like a layer:
G=Generator(image_shape=(512,512,3))
D=Discriminator((512,512,3), image_shape=(512,512,3))
some_other_image_input = Input((512,512,3))
discriminator_output = D(some_other_image_input, G) # model is used like a layer
# so the output of G is connected to the input of D
D.summary()
gan = Model(inputs=[all,your,inputs], outputs=[outputs,for,training])
# you can still use G and D like separate models, save them, train them etc
To train them together you can create another Model that has all the required inputs, calls the generator / discriminator. Think of using a lock and key idea, every model has some inputs and you can use them like layers in another Model so long you provide the correct inputs.

Resources