Embedding layer in neural machine translation with attention - pytorch

I am trying to understanding how to implement a seq-to-seq model with attention from this website.
My question: Is nn.embedding just returns some IDs for each word, so the embedding for each word would be the same during whole training? Or are they getting changed during the procedure of training?
My second question is because I am confused whether after training, the output of nn.embedding is something such as word2vec word embeddings or not.
Thanks in advance

According to the PyTorch docs:
A simple lookup table that stores embeddings of a fixed dictionary and size.
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
In short, nn.Embedding embeds a sequence of vocabulary indices into a new embedding space. You can indeed roughly understand this as a word2vec style mechanism.
As a dummy example, let's create an embedding layer that takes as input a total of 10 vocabularies (i.e. the input data only contains a total of 10 unique tokens), and returns embedded word vectors living in 5-dimensional space. In other words, each word is represented as 5-dimensional vectors. The dummy data is a sequence of 3 words with indices 1, 2, and 3, in that order.
>>> embedding = nn.Embedding(10, 5)
>>> embedding(torch.tensor([1, 2, 3]))
tensor([[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991]],
grad_fn=<EmbeddingBackward>)
You can see that each of the three words are now represented as 5-dimensional vectors. We also see that there is a grad_fn function, which means that the weights of this layer will be adjusted through backprop. This answers your question of whether embedding layers are trainable: the answer is yes. And indeed this is the whole point of embedding: we expect the embedding layer to learn meaningful representations, the famous example of king - man = queen being the classic example of what these embedding layers can learn.
Edit
The embedding layer is, as the documentation states, a simple lookup table from a matrix. You can see this by doing
>>> embedding.weight
Parameter containing:
tensor([[-1.1728, -0.1023, 0.2489, -1.6098, 1.0426],
[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991],
[-0.4569, -1.9014, -0.0758, -0.6069, -1.2985],
[ 0.4545, 0.3246, -0.7277, 0.7236, -0.8096],
[ 1.2569, 1.2437, -1.0229, -0.2101, -0.2963],
[-0.3394, -0.8099, 1.4016, -0.8018, 0.0156],
[ 0.3253, -0.1863, 0.5746, -0.0672, 0.7865],
[ 0.0176, 0.7090, -0.7630, -0.6564, 1.5690]], requires_grad=True)
You will see that the first, second, and third rows of this matrix corresponds to the result that was returned in the example above. In other words, for a vocabulary whose index is n, the embedding layer will simply "lookup" the nth row in its weights matrix and return that row vector; hence the lookup table.

Related

fasttext produces a different vector after training

Here's my training:
import fasttext
model = fasttext.train_unsupervised('data.txt', model='skipgram')
Now, let's observe the first vector (omitted the full output for readability)
model.get_input_vector(0)
# array([-0.1988439 , 0.40966552, 0.47418243, 0.148709 , 0.5891477
On the other hand, let's input the first string into our model:
model[data.iloc[0]]
# array([ 0.10782535, 0.3055557 , 0.19097836, -0.15849613, 0.14204402
We get a different vector.
Why?
You should have explained more about data structure. By the way, when you are using model[data.iloc[0]], it is equivalent to model.get_word_vector(data.iloc[0]). So, you should pass a word to the model.
On the other hand, model.get_input_vector(0) might input a sentence to the model. Therefore, you can compare the result of model.get_input_vector(0) with model.get_sentence_vector(data.iloc[0]), if data.iloc[0] is a sentence. Otherwise, you should get the first word in the data to input to the model and then compare their vectors.

why before embedding, have to make the item be sequential starting at zero

I learn collaborative filtering from this bolg, Deep Learning With Keras: Recommender Systems.
The tutorial is good, and the code working well. Here is my code.
There is one thing confuse me, the author said,
The user/movie fields are currently non-sequential integers representing some unique ID for that entity. We need them to be sequential starting at zero to use for modeling (you'll see why later).
user_enc = LabelEncoder()
ratings['user'] = user_enc.fit_transform(ratings['userId'].values)
n_users = ratings['user'].nunique()
But he didn't seem to metion the reason, I don't why need to do that.Can some one explain for me?
Embeddings are assumed to be sequential.
The first input of Embedding is the input dimension.
So, if the input exceeds the input dimension the value is ignored.
Embedding assumes that max value in the input is input dimension -1 (it starts from 0).
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding?hl=ja
As an example, the following code will generate embeddings only for input [4,3] and will skip the input [7, 8] since input dimension is 5.
I think it is more clear to explain it with tensorflow;
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
model = Sequential()
model.add(Embedding(5, 1, input_length=2))
input_array = np.array([[4,3], [7,8]])
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
You can increase the input dimension to 9 and then you will get embeddings for both inputs.
You could increase the input dimension to max number + 1 in the original data set, but this is not efficient.
It is actually similar to one-hot encoding where sequential data saves great amount of memory.

Some diverging issues of Word2Vec in Gensim using high alpha values

I am implementing word2vec in gensim, on a corpus with nested lists (collection of tokenized words in sentences of sentences form) with 408226 sentences (lists) and a total of 3150546 words or tokens.
I am getting a meaningful results (in terms of the similarity between two words using model.wv.similarity) with the chosen values of 200 as size, window as 15, min_count as 5, iter as 10 and alpha as 0.5. All are lemmatized words and these all are input to models with vocabulary as 32716.
The results incurred from default alpha value, size, window and dimensions are meaningless for me based on the used data in computing the similarity values. However higher value of alpha as 0.5 gives me some meaningful results in terms of inducing meaningful similarity scores between two words. However, when I calculate the top n similar words, it's again meaningless. Does I need to change the entire parameters used in the initial training process.
I am still unable to reveal the exact reason, why the model behaves good with such a higher alpha value in computing the similarity between two words of the used corpus, whereas it's meaningless while computing the top n similar words with scores for an input word. Why is this the case?
Does it is diverging towards optimal solution. How to check this?
Any idea why is it the case is deeply appreciated.
Note: I'm using Python 3.7 on Windows machine with anaconda prompt and giving input to the model from a file.
This is what I have tried.
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.models import Word2Vec
import ast
path = "F:/Folder/"
def load_data():
global Sentences
Sentences = []
for file in ['data_d1.txt','data_d2.txt']:
with open(path + file, 'r', encoding = 'utf-8') as f1:
Sentences.extend(ast.literal_eval(*f1.readlines()))
load_data()
def initialize_word_embedding():
model = Word2Vec(Sentences, size = 200, window = 15, min_count = 5, iter = 10, workers = 4)
print(model)
print(len(model.wv.vocab))
print(model.wv.similarity(w1 = 'structure', w2 = '_structure_'))
similarities = model.wv.most_similar('system')
for word, score in similarities:
print(word , score)
initialize_word_embedding()
The example of Sentences list is as follows:
[['scientist', 'time', 'comet', 'activity', 'sublimation', 'carbon', 'dioxide', 'nears', 'ice', 'system'], ['inconsistent', 'age', 'system', 'year', 'size', 'collision'], ['intelligence', 'system'], ['example', 'application', 'filter', 'image', 'motion', 'channel', 'estimation', 'equalization', 'example', 'application', 'filter', 'system']]
The data_d1.txt and data_d2.txt is a nested list (list of lists of lemmatized tokenized words). I have preprocessed the raw data and save it in a file. Now giving the same as input. For computing the lemmatizing tokens, I have used the popular WordNet lemmatizer.
I need the word-embedding model to calculate the similarity between two words and computing the most_similar words of a given input word. I am getting some meaningful scores for the model.wv.similarity() method, whereas in calculating the most_similar() words of a word (say, system as shown in above). I am not getting the desired results.
I am guessing the model is getting diverged from the global minima, with the use of high alpha values.
I am confused what should be the dimension size, window for inducing some meaningful results, as there is no such rules regarding how to compute the the size and window.
Any suggestion is appreciated. The size of total sentences and words are specified above in the question.
Results what I am getting without setting alpha = 0.5
Edit to Recent Comment:
Results:
Word2Vec(vocab=32716, size=200, alpha=0.025)
The similarity between set and _set_ is : 0.000269373188960656
which is meaningless for me as it is very very less in terms of accuracy, But, I am a getting 71% by setting alpha as 0.5, which seems to be meaningful for me as the word set is same for both the domains.
Explanation: The word set should be same for both the domains (as I am comparing the data of two domains with same word). Don't get confused with word _set_, this is because the word is same as set, I have injected a character _ at start and end to distinguish the same for two different domains.
The top 10 words along with scores of _set_ are:
_niche_ 0.6891741752624512
_intermediate_ 0.6883598566055298
_interpretation_ 0.6813371181488037
_printer_ 0.675414502620697
_finer_ 0.6625382900238037
_pertinent_ 0.6620787382125854
_respective_ 0.6619025468826294
_converse_ 0.6610435247421265
_developed_ 0.659270167350769
_tent_ 0.6588765382766724
Whereas, the top 10 words for set are:
cardinality 0.633270263671875
typereduction 0.6233855485916138
zdzisław 0.619156002998352
crisp 0.6165326833724976
equivalenceclass 0.605925977230072
pawlak 0.6058803200721741
straight 0.6045454740524292
culik 0.6040038466453552
rin 0.6038737297058105
multisets 0.6035065650939941
Why the cosine similarity value is 0.00 for the word set for two different data.

How to learn the embeddings in Pytorch and retrieve it later

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

how to convert Word to vector using embedding layer in Keras

I am having a word embedding file as shown below click here to see the complete file in github.I would like to know the procedure for generating word embeddings So that i can generate word embedding for my personal dataset
in -0.051625 -0.063918 -0.132715 -0.122302 -0.265347
to 0.052796 0.076153 0.014475 0.096910 -0.045046
for 0.051237 -0.102637 0.049363 0.096058 -0.010658
of 0.073245 -0.061590 -0.079189 -0.095731 -0.026899
the -0.063727 -0.070157 -0.014622 -0.022271 -0.078383
on -0.035222 0.008236 -0.044824 0.075308 0.076621
and 0.038209 0.012271 0.063058 0.042883 -0.124830
a -0.060385 -0.018999 -0.034195 -0.086732 -0.025636
The 0.007047 -0.091152 -0.042944 -0.068369 -0.072737
after -0.015879 0.062852 0.015722 0.061325 -0.099242
as 0.009263 0.037517 0.028697 -0.010072 -0.013621
Google -0.028538 0.055254 -0.005006 -0.052552 -0.045671
New 0.002533 0.063183 0.070852 0.042174 0.077393
with 0.087201 -0.038249 -0.041059 0.086816 0.068579
at 0.082778 0.043505 -0.087001 0.044570 0.037580
over 0.022163 -0.033666 0.039190 0.053745 -0.035787
new 0.043216 0.015423 -0.062604 0.080569 -0.048067
I was able to convert each words in a dictionary to the above format by following the below steps:
initially represent each words in the dictionary by unique integer
take each integer one by one and perform array([[integer]]) and give it as input array in below code
then the word corresponding to integer and respective output vector can be stored to json file ( i used output_array.tolist() for storing the vector in json format)
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding
model = Sequential()
model.add(Embedding(dictionary_size_here, sizeof_embedding_vector, input_length= input_length_here))
input_array = array([[integer]]) #each integer is fed one by one using a loop
model.compile('rmsprop', 'mse')
output_array = model.predict(input_array)
Reference
How does Keras 'Embedding' layer work?
It is important to understand that there are multiple ways to generate an embedding for words. The popular word2vec, for example, can generate word embeddings using CBOW or Skip-grams.
Hence, one could have multiple "procedures" to generate word embeddings. One of the easier to understand method (albeit with its drawbacks) to generate an embedding is using Singular Value Decomposition (SVD). The steps are briefly described below.
Create a Term-Document matrix. i.e. terms as rows and the document it appears in as columns.
Perform SVD
Truncate the output vector for the term to n dimension. In your example above, n = 5.
You can have a look at this link for a more detailed description using word2vec's skipgram model to generate an embedding. Word2Vec Tutorial - The Skip-Gram Model.
For more information on SVD, you can look at this and this.

Resources