How to learn the embeddings in Pytorch and retrieve it later - pytorch

I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?

model.parameters() returns all the parameters of your model, including the embeddings.
So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here).
# summing it up:
# this line specifies which parameters are trained with the optimizer
# model.parameters() just returns all parameters
# embedding class weights are also parameters and will thus be trained
optimizer = optim.SGD(model.parameters(), lr=0.001)
You can see that your embedding weights are also of type Parameter by doing so:
import torch
embedding_maxtrix = torch.nn.Embedding(10, 10)
print(type(embedding_maxtrix.weight))
This will output the type of the weights, which is Parameter:
<class 'torch.nn.parameter.Parameter'>
I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else?
embedding_maxtrix = torch.nn.Embedding(5, 5)
# this will get you a single embedding vector
print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0])))
# of course you can do the same for a seqeunce
print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3])))
# this will give the the whole embedding matrix
print('Getting weights:\n', embedding_maxtrix.weight.data)
Output:
Getting a single vector:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>)
Getting vectors for a sequence:
tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]],
grad_fn=<EmbeddingBackward>)
Getting weights:
tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020],
[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367],
[-0.1167, -2.2139, 1.6918, -0.3483, 0.3508],
[ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502],
[-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]])
I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well.
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

Related

Embedding layer in neural machine translation with attention

I am trying to understanding how to implement a seq-to-seq model with attention from this website.
My question: Is nn.embedding just returns some IDs for each word, so the embedding for each word would be the same during whole training? Or are they getting changed during the procedure of training?
My second question is because I am confused whether after training, the output of nn.embedding is something such as word2vec word embeddings or not.
Thanks in advance
According to the PyTorch docs:
A simple lookup table that stores embeddings of a fixed dictionary and size.
This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.
In short, nn.Embedding embeds a sequence of vocabulary indices into a new embedding space. You can indeed roughly understand this as a word2vec style mechanism.
As a dummy example, let's create an embedding layer that takes as input a total of 10 vocabularies (i.e. the input data only contains a total of 10 unique tokens), and returns embedded word vectors living in 5-dimensional space. In other words, each word is represented as 5-dimensional vectors. The dummy data is a sequence of 3 words with indices 1, 2, and 3, in that order.
>>> embedding = nn.Embedding(10, 5)
>>> embedding(torch.tensor([1, 2, 3]))
tensor([[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991]],
grad_fn=<EmbeddingBackward>)
You can see that each of the three words are now represented as 5-dimensional vectors. We also see that there is a grad_fn function, which means that the weights of this layer will be adjusted through backprop. This answers your question of whether embedding layers are trainable: the answer is yes. And indeed this is the whole point of embedding: we expect the embedding layer to learn meaningful representations, the famous example of king - man = queen being the classic example of what these embedding layers can learn.
Edit
The embedding layer is, as the documentation states, a simple lookup table from a matrix. You can see this by doing
>>> embedding.weight
Parameter containing:
tensor([[-1.1728, -0.1023, 0.2489, -1.6098, 1.0426],
[-0.7077, -1.0708, -0.9729, 0.5726, 1.0309],
[ 0.2056, -1.3278, 0.6368, -1.9261, 1.0972],
[ 0.8409, -0.5524, -0.1357, 0.6838, 3.0991],
[-0.4569, -1.9014, -0.0758, -0.6069, -1.2985],
[ 0.4545, 0.3246, -0.7277, 0.7236, -0.8096],
[ 1.2569, 1.2437, -1.0229, -0.2101, -0.2963],
[-0.3394, -0.8099, 1.4016, -0.8018, 0.0156],
[ 0.3253, -0.1863, 0.5746, -0.0672, 0.7865],
[ 0.0176, 0.7090, -0.7630, -0.6564, 1.5690]], requires_grad=True)
You will see that the first, second, and third rows of this matrix corresponds to the result that was returned in the example above. In other words, for a vocabulary whose index is n, the embedding layer will simply "lookup" the nth row in its weights matrix and return that row vector; hence the lookup table.

Using pretrained word2vector model

I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model:
Code:
import gensim
w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True)
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
EMBEDDING_DIM=300
# Function to create weight matrix from word2vec gensim model
def get_weight_matrix(model, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = model[word]
return weight_matrix
embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index)
Im getting the following error:
Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.)
But regarding your question:
If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.)
You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens.
But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for.
(Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)

How to tune weights in Voting Classifier (Sklearn)

I am trying to do the following:
vc = VotingClassifier(estimators=[('gbc',GradientBoostingClassifier()),
('rf',RandomForestClassifier()),('svc',SVC(probability=True))],
voting='soft',n_jobs=-1)
params = {'weights':[[1,2,3],[2,1,3],[3,2,1]]}
grid_Search = GridSearchCV(param_grid = params, estimator=vc)
grid_Search.fit(X_new,y)
print(grid_Search.best_Score_)
In this, I want to tune the parameter weights. If I use GridSearchCV, it is taking a lot of time. Since it needs to fit the model for each iteration. Which is not required, I guess. Better would be use something like prefit used in SelectModelFrom function from sklearn.model_selection.
Is there any other option or I am misinterpreting something?
The following code (in my repo) would do this.
It contains a class VotingClassifierCV. It first makes cross-validated predictions for all classifiers. Then loops over all weights, choosing the best combination, and using pre-calculated predictions.
A compute friendlier way would be to first parameter tune each classifier separately on your training data. Then weight each classifier proportional to your target metric (say accuracy_score) from your validate data.
# parameter tune
models = {
'rf': GridSearchCV(rf_params, RandomForestClassifier()).fit(X_trian, y_train),
'svc': GridSearchCV(svc_params, SVC()).fit(X_train, y_train),
}
# relative weights
model_scores = {
name: sklearn.metrics.accuracy_score(
y_validate,
model.predict(X_validate),
normalized=True
)
for name, model in models.items()
}
total_score = sum(model_scores.values())
# combine the parts
combined_model = VotingClassifier(
list(models.items()),
weights=[
model_scores[name] / total_score
for name in models.keys()
]
).fit(X_learn, y_learn)
Finally, you may fit the combined model with your learning (train + validate) data & evaluate with your test data.

Default value in Svm prediction Scikitlearn

I am using scikitlearn for svm classification.
I need a classifier that returns default value when a given test item doesn't match any of the training-set items, i.e. when the distance is very high. Is that possible?
For Example
Let's say my training-set is
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64]]
and labels
y=[0,1,2]
then I run training
clf = svm.SVC()
clf.fit(X, y)
then I run prediction
clf.predict([-100,-100,-200])
Now as we can see the test-item [-100,-100,-200] is too far away from any of the training-items, in this case the prediction will yield [2] which is this item [16, 16,64], is there anyway to make it return anything else (not from training-set)?
I think you can create a label for those big values, and added into your training set.
X= [[0.5,0.5,2],[4, 4,16],[16, 16,64],[-100,-100,200]]
Y=[0,1,2,100]
and give a try.
Since SVM is supervised learning, which means the 'OUTPUT' have to be specified. If you are not certain about the 'OUTPUT', do some non supervised clustering (kmeans for example), and have a rough idea how many possible 'OUTPUT' you will expect.

How to use hidden layer activations to construct loss function and provide y_true during fitting in Keras?

Assume I have a model like this. M1 and M2 are two layers linking left and right sides of the model.
The example model: Red lines indicate backprop directions
During training, I hope M1 can learn a mapping from L2_left activation to L2_right activation. Similarly, M2 can learn a mapping from L3_right activation to L3_left activation.
The model also needs to learn the relationship between two inputs and the output.
Therefore, I should have three loss functions for M1, M2, and L3_left respectively.
I probably can use:
model.compile(optimizer='rmsprop',
loss={'M1': 'mean_squared_error',
'M2': 'mean_squared_error',
'L3_left': mean_squared_error'})
But during training, we need to provide y_true, for example:
model.fit([input_1,input_2], y_true)
In this case, the y_true is the hidden layer activations and not from a dataset.
Is it possible to build this model and train it using it's hidden layer activations?
If you have only one output, you must have only one loss function.
If you want three loss functions, you must have three outputs, and, of course, three Y vectors for training.
If you want loss functions in the middle of the model, you must take outputs from those layers.
Creating the graph of your model: (if the model is already defined, see the end of this answer)
#Here, all "SomeLayer(blabla)" could be replaced by a "SomeModel" if necessary
#Example of using a layer or a model:
#M1 = SomeLayer(blablabla)(L12)
#M1 = SomeModel(L12)
from keras.models import Model
from keras.layers import *
inLef = Input((shape1))
inRig = Input((shape2))
L1Lef = SomeLayer(blabla)(inLef)
L2Lef = SomeLayer(blabla)(L1Lef)
M1 = SomeLayer(blablaa)(L2Lef) #this is an output
L1Rig = SomeLayer(balbla)(inRig)
conc2Rig = Concatenate(axis=?)([L1Rig,M1]) #Or Add, or Multiply, however you're joining the models
L2Rig = SomeLayer(nlanlab)(conc2Rig)
L3Rig = SomeLayer(najaljd)(L2Rig)
M2 = SomeLayer(babkaa)(L3Rig) #this is an output
conc3Lef = Concatenate(axis=?)([L2Lef,M2])
L3Lef = SomeLayer(blabla)(conc3Lef) #this is an output
Creating your model with three outputs:
Now you've got your graph ready and you know what the outputs are, you create the model:
model = Model([inLef,inRig], [M1,M2,L3Lef])
model.compile(loss='mse', optimizer='rmsprop')
If you want different losses for each output, then you create a list:
#example of custom loss function, if necessary
def lossM1(yTrue,yPred):
return keras.backend.sum(keras.backend.abs(yTrue-yPred))
#compiling with three different loss functions
model.compile(loss = [lossM1, 'mse','binary_crossentropy'], optimizer =??)
But you've got to have three different yTraining too, for training with:
model.fit([input_1,input_2], [yTrainM1,yTrainM2,y_true], ....)
If your model is already defined and you don't create it's graph like I did:
Then, you have to find in yourModel.layers[i] which ones are M1 and M2, so you create a new model like this:
M1 = yourModel.layers[indexForM1].output
M2 = yourModel.layers[indexForM2].output
newModel = Model([inLef,inRig], [M1,M2,yourModel.output])
If you want that two outputs be equal:
In this case, just subtract the two outputs in a lambda layer, and make that lambda layer be an output of your model, with expected values = 0.
Using the exact same vars as before, we'll just create two addictional layers to subtract outputs:
diffM1L1Rig = Lambda(lambda x: x[0] - x[1])([L1Rig,M1])
diffM2L2Lef = Lambda(lambda x: x[0] - x[1])([L2Lef,M2])
Now your model should be:
newModel = Model([inLef,inRig],[diffM1L1Rig,diffM2L2lef,L3Lef])
And training will expect those two differences to be zero:
yM1 = np.zeros((shapeOfM1Output))
yM2 = np.zeros((shapeOfM2Output))
newModel.fit([input_1,input_2], [yM1,yM2,t_true], ...)
Trying to answer to the last part: how to make gradients only affect one side of the model.
...well.... at first that sounds unfeasible to me. But, if that is similar to "train only a part of the model", then it's totally ok by defining models that only go to a certain point and making part of the layers untrainable.
By doing that, nothing will affect those layers. If that's what you want, then you can do it:
#using the previous vars to define other models
modelM1 = Model([inLef,inRig],diffM1L1Rig)
This model above ends in diffM1L1Rig. Before compiling, you must set L2Right untrainable:
modelM1.layers[??].trainable = False
#to find which layer is the right one, you may define then using the "name" parameter, or see in the modelM1.summary() the shapes, types etc.
modelM1.compile(.....)
modelM1.fit([input_1, input_2], yM1)
This suggestion makes you train only a single part of the model. You can repeat the procedure for M2, locking the layers you need before compiling.
You can also define a full model taking all layers, and lock only the ones you want. But you won't be able (I think) to make half gradients pass by one side and half the gradients pass by the other side.
So I suggest you keep three models, the fullModel, the modelM1, and the modelM2, and you cycle them in training. One epoch each, maybe....
That should be tested....

Resources