In the embedding example here:
https://www.tensorflow.org/text/guide/word_embeddings
result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
result.shape
TensorShape([2, 3, 5])
Then it explains:
When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the simplest.
The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.
Then the code:
embedding_dim=16
model = Sequential([
vectorize_layer,
Embedding(vocab_size, embedding_dim, name="embedding"),
GlobalAveragePooling1D(),
Dense(16, activation='relu'),
Dense(1)
])
The GlobalAveragePooling1D should calculate a single integer for each word's embedding of dimension = n. I don't understand this part:
This allows the model to handle input of variable length, in the simplest way possible.
Similarly:
To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches.
In each embedding layer, input length is already fixed by the parameter 'input_length'. Truncation and padding are used to ensure the fixed length of the input. So what does it mean by saying GlobalAveragePooling1D is used to convert from this sequence of variable length to a fixed representation? What does the 'variable length' mean here?
Related
Returns last_hidden_state (torch.FloatTensor of shape (batch_size,
sequence_length, hidden_size)): Sequence of hidden-states at the
output of the last layer of the model.
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence
(classification token) further processed by a Linear layer and a Tanh
activation function. The Linear layer weights are trained from the
next sentence prediction (classification) objective during
pre-training.
This output is usually not a good summary of the semantic content of
the input, you’re often better with averaging or pooling the sequence
of hidden-states for the whole input sequence.
hidden_states (tuple(torch.FloatTensor), optional, returned when
config.output_hidden_states=True): Tuple of torch.FloatTensor (one for
the output of the embeddings + one for the output of each layer) of
shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the
initial embedding outputs.
attentions (tuple(torch.FloatTensor), optional, returned when
config.output_attentions=True): Tuple of torch.FloatTensor (one for
each layer) of shape (batch_size, num_heads, sequence_length,
sequence_length).
Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
This is from https://huggingface.co/transformers/model_doc/bert.html#bertmodel. Although the description in the document is clear, I still don't understand the hidden_states of returns. There is a tuple, one for the output of the embeddings, and the other for the output of each layer.
Please tell me how to distinguish them, or what is the meaning of them? Thanks very much!![wink~
hidden_states (tuple(torch.FloatTensor), optional, returned when config.output_hidden_states=True): Tuple of torch.FloatTensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For a given token, its input representation is constructed by summing the corresponding token embedding, segment embedding, and position embedding. This input representation is called the initial embedding output which can be found at index 0 of the tuple hidden_states.
This figure explains how the embeddings are calculated.
The remaining 12 elements in the tuple contain the output of the corresponding hidden layer. E.g: the last hidden layer can be found at index 12, which is the 13th item in the tuple. The dimension of both the initial embedding output and the hidden states are [batch_size, sequence_length, hidden_size]. It would be useful to compare the indexing of hidden_states bottom-up with this image from the BERT paper.
last_hidden_state contains the hidden representations for each token in each sequence of the batch. So the size is (batch_size, seq_len, hidden_size).
You can refer to Difference between CLS hidden state and pooled_output for more clarification.
I find the answer in the length of this tuple. The length is (1+num_layers). And the output of the last layer is different from the embedding output, because layer output plus the initial embedding. :D
As we know:
Keras.layers.Embedding turns positive integers (indexes) into dense vectors of fixed size. e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
I want to know how can I see or print the dense vector output.
Or
how to see a tensor object's output?
You can take a look here : https://keras.io/getting-started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer
In a few words :
Create a new model from your trained model with the output layer in which you are interested, then use the methode predict.
layer_name = 'my_layer'
intermediate_layer_model = Model(inputs=model.input,
outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict(data)
Below is the RNNs model.
And In keras, the output shape of the below code is (N,10).
model=Sequential()
model.add(RNN(10, input_shape=(1, look_back)))
I know that N in the output shape (N,10) is the batch_size.
What I want to know is that '10' is the ?
Yes 10 is D_h. Basically RNN(units) denotes that the layer returns the vector h_t from the final timestep of the RNN with the size being the number you specify for the units parameter, which in this case is 10. Or in other words, units parameter denotes the dimensionality of the output vector of that particular layer.
I have a series of processed audio files I am using as input into a CNN using Keras. Does the Keras 1D Convolutional layer support variable sequence lengths? The Keras documentation makes this unclear.
https://keras.io/layers/convolutional/
At the top of the documentation it mentions you can use (None, 128) for variable-length sequences of 128-dimensional vectors. Yet at the bottom it declares that the input shape must be a
3D tensor with shape: (batch_size, steps, input_dim)
Given the following example how should I input sequences of variable length into the network
Lets say I have two examples (a and b) containing X 1 dimensional vectors of length 100 that I want to feed into the 1DConv layer as input
a.shape = (100, 100)
b.shape = (200, 100)
Can I use an input shape of (2, None, 100)? Do I need to concatenate these tensors into c where
c.shape = (300, 100)
Then reshape it to be something
c_reshape.shape = (3, 100, 100)
Where 3 is the batch size, 100, is the number of steps, and the second 100 is the input size? The documentation on the input vector is not very clear.
Keras supports variable lengths by using None in the respective dimension when defining the model.
Notice that often input_shape refers to the shape without the batch size.
So, the 3D tensor with shape (batch_size, steps, input_dim) suits perfectly a model with input_shape=(steps, input_dim).
All you need to make this model accept variable lengths is use None in the steps dimension:
input_shape=(None, input_dim)
Numpy limitation
Now, there is a numpy limitation about variable lengths. You cannot create a numpy array with a shape that suits variable lengths.
A few solutions are available:
Pad your sequences with dummy values until they all reach the same size so you can put them into a numpy array of shape (batch_size, length, input_dim). Use Masking layers to disconsider the dummy values.
Train with separate numpy arrays of shape (1, length, input_dim), each array having its own length.
Group your images by sizes into smaller arrays.
Be careful with layers that don't support variable sizes
In convolutional models using variable sizes, you can't for instance, use Flatten, the result of the flatten would have a variable size if this were possible. And the following Dense layers would not be able to have a constant number of weights. This is impossible.
So, instead of Flatten, you should start using GlobalMaxPooling1D or GlobalAveragePooling1D layers.
Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features.
The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.
The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)