pytorch loading pretrained weights from json file - pytorch
I downloaded a json file of word-embeddings. The file contents look like this:
{
"in":[0.052956,0.065460,0.066195,0.047072,0.052221,-0.082009,-0.061415,-0.116210,0.015629,0.099293,-0.085686,-0.028133,0.052221,0.058840,-0.077596,-0.073550,0.033282,0.077228,-0.045785,-0.027214,-0.034201,0.035672,-0.090835,-0.048175,0.001701,0.027949,-0.002195,0.088628,0.046521,0.048175,0.061047,-0.051853,-0.016089,0.041556,-0.064357,0.051853,-0.096351,-0.025007,0.074286,0.132391,0.083480,-0.026110,-0.035488,-0.006390,0.027030,0.077596,0.020318,-0.021605,-0.003861,0.080170,0.045050,0.070976,0.025375,-0.020410,-0.070976,0.000776,-0.036407,0.025926,0.061047,-0.085318,-0.066931,0.027030,-0.109590,-0.183876,-0.046337,0.039901,0.042843,0.135333,0.045969,0.065460,0.093409,-0.030340,0.017009,0.133862,-0.022341,-0.022341,0.088260,0.023444,-0.072447,0.050014,0.003540,-0.060311,0.047440,-0.015538,-0.041188,-0.102235,-0.047808,0.062886,-0.048175,0.016181,0.058105,-0.027949,-0.025375,-0.138275,-0.054795,0.011952,0.070241,-0.046337,-0.010711,-0.002597,0.008366,-0.119152,-0.012871,0.004666,-0.006574,-0.060679,-0.011492,-0.066195,0.002620,-0.012136,-0.009286,0.073550,-0.105177,-0.064724,-0.020226,0.040637,0.100028,0.084951,0.091202,0.064357,-0.005355,0.033649,-0.109590,-0.002413,-0.088628,-0.049279,0.053692,-0.070976,-0.022801,0.090467,0.060311,-0.071344,-0.122094,-0.058473,0.015997,-0.061415,0.002965,-0.118416,-0.073918,0.029972,0.029604,-0.006849,0.077596,0.051117,-0.032178,0.047808,-0.036959,0.015721,-0.125771,0.070241,0.070608,0.005172,0.040453,0.039533,-0.018388,-0.024455,-0.046337,-0.004183,0.072447,0.028501,0.009194,-0.033098,-0.005631,0.079434,0.015354,0.109590,0.061782,0.004344,0.003448,-0.069873,-0.104441,-0.043211,-0.038798,-0.098557,-0.105177,-0.015446,-0.020410,0.024639,0.079067,-0.001758,-0.017009,0.000379,-0.083480,0.063989,-0.097822,-0.013147,-0.000270,0.081273,0.066931,0.033649,0.018939,0.017928,0.061047,0.017836,-0.082744,0.004045,-0.013331,-0.025559,-0.024823,-0.123565,0.072079,-0.013791,0.003999,-0.025926,-0.033282,-0.050014,-0.013515,-0.022341,-0.005723,-0.038614,-0.040820,0.067299,-0.054059,0.011492,-0.062150,-0.023904,0.026846,-0.015997,-0.044682,-0.009837,0.035304,0.017376,0.015813,-0.059208,-0.006068,0.014710,-0.004183,0.031259,0.020962,0.010251,0.026110,-0.137539,0.090467,0.055898,-0.030891,-0.007493,0.032362,-0.005493,0.092673,0.043395,-0.040269,-0.024272,-0.006849,-0.035120,0.033098,-0.038246,0.051853,0.002252,-0.003149,-0.033282,0.055530,-0.009608,0.050750,0.004735,0.056634,-0.028501,0.003678,0.033649,-0.050750,0.007309,0.003563,0.015446,0.053692,0.128713,0.130920,0.041924,0.068770,-0.028133,0.037511,-0.029604,0.033282,0.047072,0.036591,-0.040085,0.036775,-0.098557,-0.021789,-0.027214,-0.045785,-0.043211,0.092673,-0.062150,-0.008964,0.094144,0.001023,0.048175,-0.080170,-0.108119,-0.031811,0.018112,-0.127242,-0.066931,-0.060679,0.048911,0.046153,-0.035672,-0.044314,-0.035856,0.010895,-0.047072],
"for":[-0.008512,-0.034224,0.032284,0.045868,-0.013143,-0.046221,-0.000948,-0.052219,0.046574,0.062451,-0.122785,-0.028756,0.051513,-0.018700,0.013143,0.098792,0.104438,-0.024345,-0.070566,-0.086796,-0.057511,0.045162,-0.048338,0.053630,0.016407,0.024169,-0.130547,0.037576,0.010012,0.067038,0.002536,-0.006571,-0.070213,0.049043,-0.006351,0.031931,-0.096675,-0.071977,0.023992,0.020200,0.112200,-0.012790,0.010320,-0.079387,-0.061745,-0.052924,-0.017818,0.124902,0.044633,0.064568,-0.017553,0.102321,-0.023816,0.019847,-0.112200,0.005689,-0.051160,0.031578,0.004344,-0.040399,-0.106555,0.020552,-0.095970,-0.127724,-0.065979,-0.036694,-0.018788,-0.107260,-0.058217,0.108672,-0.031402,0.057158,0.023992,0.065274,0.016407,-0.045162,0.118551,0.062098,-0.008953,0.141838,-0.044986,0.016230,-0.021787,0.015348,0.002404,-0.040046,-0.052924,0.021523,0.035989,0.012614,0.075506,0.028050,0.061392,-0.179238,0.050102,-0.107966,0.042163,0.069155,-0.024169,0.045515,0.015436,-0.105143,0.038811,-0.065626,-0.018347,0.032813,0.003837,-0.083621,-0.014113,0.087502,0.023287,0.068449,-0.046574,0.016407,0.087149,0.043574,0.087149,0.035283,0.067391,0.048338,0.021170,-0.024698,-0.080445,0.038635,-0.018524,0.012878,0.044986,-0.018700,0.105143,0.045162,0.077975,-0.117845,-0.070566,-0.076564,-0.061745,-0.064215,0.073036,-0.057511,0.006086,0.017377,0.094558,0.037047,0.058923,0.067743,-0.042340,-0.069860,-0.020464,-0.105143,-0.106555,0.105143,-0.012702,0.023816,-0.061745,-0.007939,-0.026815,-0.009879,0.025933,-0.005954,0.036341,-0.068449,0.034577,0.014995,0.022140,0.093853,0.038106,0.013584,-0.012702,0.025227,0.013231,-0.007145,-0.133370,-0.064921,-0.020993,-0.043927,-0.037047,-0.001709,0.047985,-0.059628,-0.028932,0.069507,-0.111494,-0.110789,0.020464,0.009482,0.021611,-0.008777,-0.069860,0.017906,0.139721,0.009394,0.017465,-0.025933,0.071272,-0.069860,-0.144660,-0.009967,0.062098,-0.057864,-0.127724,-0.126313,0.003705,-0.025227,-0.039517,0.067743,-0.067391,-0.008644,-0.000408,0.070566,0.017906,-0.028756,0.007057,0.085385,0.018612,0.088913,0.046574,0.051160,0.021170,-0.035812,-0.056453,0.020905,0.032990,-0.031049,0.018700,-0.037400,0.101615,0.003087,-0.027344,0.019847,0.043398,0.020464,0.020288,-0.026462,0.094558,-0.000070,-0.050102,-0.015966,0.049043,-0.016848,-0.011070,-0.042163,0.044104,0.000466,0.002889,-0.051513,0.066332,0.018965,0.014466,0.025580,-0.041810,-0.021434,0.019758,0.018171,0.043574,0.095264,-0.003153,0.001974,0.043222,0.071272,-0.066332,-0.033166,-0.012614,0.027697,-0.013849,0.033519,0.034577,0.070919,-0.029108,0.068096,-0.025051,-0.030520,0.050807,-0.009879,0.076917,0.011908,0.095264,-0.001224,-0.006130,-0.103026,-0.033695,-0.079387,0.059275,-0.029638,-0.013672,0.063509,-0.002029,0.172181,-0.034048,-0.016583,0.029461,0.021170,-0.016318,0.002690,-0.059628,0.058923,0.005733,0.000345,0.013319,0.051513,-0.025227,0.017465],
"that":[-0.012361,-0.022230,0.065540,0.039477,-0.086620,0.024913,-0.011163,-0.070522,0.092369,0.092752,-0.056341,-0.060557,-0.054042,0.060557,-0.108850,0.005102,0.008624,-0.011881,-0.000755,-0.023763,-0.000124,0.030087,-0.018972,-0.036028,0.074355,-0.043310,-0.050975,0.004791,0.000671,0.048676,-0.042735,0.011067,0.017439,-0.035261,0.087386,-0.030279,0.040244,0.019739,0.013319,0.049442,0.108083,0.106550,0.051359,-0.050592,-0.018876,-0.010492,-0.029129,0.003378,-0.012361,0.014948,0.085087,0.035070,-0.035261,-0.074738,0.068223,0.064390,0.005366,-0.103484,0.002144,-0.059407,0.017631,0.134912,-0.038136,0.030087,-0.069373,-0.013510,0.017152,0.105017,0.008384,0.039094,0.029895,-0.004120,0.048101,-0.039286,-0.083170,0.043693,0.121115,0.134146,0.037752,0.099651,0.064007,-0.079721,0.034495,-0.010636,-0.105017,-0.123414,0.019068,0.164041,-0.080104,-0.073589,0.038136,0.059024,0.002767,-0.096968,-0.018972,-0.001036,0.030087,0.005965,0.013894,0.034303,-0.077038,-0.045610,0.011067,0.032195,-0.027787,-0.018014,-0.102717,-0.113449,0.022709,-0.096202,-0.055958,-0.005605,-0.075888,0.045993,0.081637,0.020697,0.005941,0.028362,0.031620,0.041394,-0.160208,-0.026254,-0.022805,0.024913,-0.096968,-0.052892,0.012456,-0.067839,0.009821,-0.049442,-0.094669,0.018397,-0.103484,-0.092752,-0.009534,-0.086237,0.074738,-0.032962,0.014373,0.040627,0.011738,-0.124947,-0.017056,-0.004024,0.028171,-0.002383,-0.061324,-0.040244,-0.005821,0.068606,-0.018780,0.034686,-0.089303,0.016864,-0.003006,-0.034111,-0.081637,-0.145644,-0.035261,0.035261,-0.034878,0.014948,-0.016481,0.010588,0.011977,-0.023859,0.036603,0.080487,-0.010875,0.006468,-0.041394,0.015427,-0.059791,-0.070522,0.034495,0.006228,0.009917,-0.085087,-0.014564,-0.082021,-0.119581,-0.062090,-0.022613,-0.014660,0.076271,-0.006564,-0.027787,0.005917,0.045610,0.064390,0.022613,0.040052,0.002491,-0.014564,0.011738,-0.057108,-0.026829,0.034495,-0.038327,-0.126480,0.020122,0.028746,-0.000121,-0.000988,-0.031237,-0.025296,-0.012361,0.047718,0.076271,-0.011786,-0.026446,-0.012025,0.003665,0.025871,-0.064390,0.083554,0.121115,0.006899,-0.094285,0.048101,0.045993,0.030470,-0.012552,-0.034495,0.094285,-0.059024,0.098118,0.027596,0.057108,0.068606,0.016577,-0.057874,0.027021,-0.073972,0.009103,-0.044843,-0.061707,0.012552,0.059407,0.023955,0.003617,-0.114216,-0.019451,-0.084704,0.054042,0.045610,0.098118,-0.051359,0.004144,0.009294,0.054808,0.099651,0.051359,-0.013606,0.093519,-0.025488,0.113449,0.060174,-0.025296,-0.051742,0.049442,-0.049059,-0.075505,0.083554,-0.031237,0.091219,-0.007618,-0.027787,-0.051359,0.046184,0.127247,0.040244,0.124947,0.074738,0.059791,-0.072055,0.019739,-0.061707,0.070139,-0.045993,-0.031428,0.036028,0.024338,0.030662,0.027979,-0.083170,-0.029129,-0.126480,0.016768,0.000958,-0.008863,-0.012265,-0.026254,-0.016193,-0.015235,0.050209,0.015810,0.005390,0.047909,-0.116515],
...
I found this function to load pre-trained embeddings into pytorch:
self.embeds = torch.nn.Embedding.from_pretrained(weights)
My question is, how to load the .json file into the above function? I don't find the documentation helpful. From the docs:
CLASSMETHOD from_pretrained(
embeddings, freeze=True, padding_idx=None, max_norm=None, norm_type=2.0,
scale_grad_by_freq=False, sparse=False
)
embeddings (Tensor) – FloatTensor containing weights for the Embedding.
First dimension is being passed to Embedding as num_embeddings, second as embedding_dim.
How do I convert this json file to a "FloatTensor" in the proper format for this function?
Thanks!
weights = torch.stack([torch.Tensor(value) for _, value in in_json.items()], dim=0)
Related
Using pretrained word2vector model
I am trying to use a pretrained word2vector model to create word embeddings but i am getting the following error when Im trying to create weight matrix from word2vec genism model: Code: import gensim w2v_model = gensim.models.KeyedVectors.load_word2vec_format("/content/drive/My Drive/GoogleNews-vectors-negative300.bin.gz", binary=True) vocab_size = len(tokenizer.word_index) + 1 print(vocab_size) EMBEDDING_DIM=300 # Function to create weight matrix from word2vec gensim model def get_weight_matrix(model, vocab): # total vocabulary size plus 0 for unknown words vocab_size = len(vocab) + 1 # define weight matrix dimensions with all 0 weight_matrix = np.zeros((vocab_size, EMBEDDING_DIM)) # step vocab, store vectors using the Tokenizer's integer mapping for word, i in vocab.items(): weight_matrix[i] = model[word] return weight_matrix embedding_vectors = get_weight_matrix(w2v_model, tokenizer.word_index) Im getting the following error: Error
As a note: it's better to paste a full error is as formatted text than as an image of text. (See Why not upload images of code/errors when asking a question? for a full list of the reasons why.) But regarding your question: If you get a KeyError: word 'didnt' not in vocabulary error, you can trust that the word you've requested is not in the set-of-word-vectors you've requested it from. (In this case, the GoogleNews vectors that Google trained & released back around 2012.) You could check before looking it up – 'didnt' in w2v_model, which would return False, and then do something else. Or you could use a Python try: ... catch: ... formulation to let it happen, but then do something else when it happens. But it's up to you what your code should do if the model you've provided doesn't have the word-vectors you were hoping for. (Note: the GoogleNews vectors do include a vector for "didn't", the contraction with its internal apostrophe. So in this one case, the issue may be that your tokenization is stripping such internal-punctuation-marks from contractions, but Google chose not to when making those vectors. But your code should be ready for handling missing words in any case, unless you're sure through other steps that can never happen.)
Machine Learning Predict Another Values
I'm really new at ML. I trained my dataset then I save it with pickle. My trained dataset has text and value. I'm trying to get an estimate from my new dataset, which has only text. However, when I try to predict new values with my trained data, I'm getting an error, which says ValueError: Number of features of the model must match the input. Model n_features is 17804 and input n_features is 24635 You can check my code below. What I have to do at this point ? with open('trained.pickle', 'rb') as read_pickle: loaded=pickle.load(read_pickle) dataset2 = pandas.read_csv('/root/Desktop/predict.csv' , encoding='cp1252') X2_train=dataset2['text'] train_tfIdf = vectorizer_tfidf.fit_transform(X2_train.values.astype('U')) x = loaded.predict(train_tfIdf) print(x)
fit_transform fits to the data and then transforms it, which you don't want to do while testing. It is like retraining the tfidf. So, for the purpose of prediction, I would suggest using the transform method simply.
How to learn the embeddings in Pytorch and retrieve it later
I am building a recommendation system where I predict the best item for each user given their purchase history of items. I have userIDs and itemIDs and how much itemID was purchased by userID. I have Millions of users and thousands of products. Not all products are purchased(there are some products that no one has bought them yet). Since the users and items are big I don't want to use one-hot vectors. I am using pytorch and I want to create and train the embeddings so that I can make the predictions for each user-item pair. I followed this tutorial https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html. If it's an accurate assumption that the embedding layer is being trained, then do I retrieve the learned weights through model.parameters() method or should I use the embedding.data.weight option?
model.parameters() returns all the parameters of your model, including the embeddings. So all these parameters of your model are handed over to the optimizer (line below) and will be trained later when calling optimizer.step() - so yes your embeddings are trained along with all other parameters of the network.(you can also freeze certain layers by setting i.e. embedding.weight.requires_grad = False, but this is not the case here). # summing it up: # this line specifies which parameters are trained with the optimizer # model.parameters() just returns all parameters # embedding class weights are also parameters and will thus be trained optimizer = optim.SGD(model.parameters(), lr=0.001) You can see that your embedding weights are also of type Parameter by doing so: import torch embedding_maxtrix = torch.nn.Embedding(10, 10) print(type(embedding_maxtrix.weight)) This will output the type of the weights, which is Parameter: <class 'torch.nn.parameter.Parameter'> I'm not entirely sure what mean by retrieve. Do you mean getting a single vector, or do you want just the whole matrix to save it, or do something else? embedding_maxtrix = torch.nn.Embedding(5, 5) # this will get you a single embedding vector print('Getting a single vector:\n', embedding_maxtrix(torch.LongTensor([0]))) # of course you can do the same for a seqeunce print('Getting vectors for a sequence:\n', embedding_maxtrix(torch.LongTensor([1, 2, 3]))) # this will give the the whole embedding matrix print('Getting weights:\n', embedding_maxtrix.weight.data) Output: Getting a single vector: tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020]], grad_fn=<EmbeddingBackward>) Getting vectors for a sequence: tensor([[ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367], [-0.1167, -2.2139, 1.6918, -0.3483, 0.3508], [ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502]], grad_fn=<EmbeddingBackward>) Getting weights: tensor([[-0.0144, -0.6245, 1.3611, -1.0753, 0.5020], [ 0.9277, -0.1879, -1.4999, 0.2895, 0.8367], [-0.1167, -2.2139, 1.6918, -0.3483, 0.3508], [ 2.3763, -1.3408, -0.9531, 2.2081, -1.5502], [-0.5829, -0.1918, -0.8079, 0.6922, -0.2627]]) I hope this answers your question, you can also take a look at the documentation, there you can find some useful examples as well. https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
how to convert Word to vector using embedding layer in Keras
I am having a word embedding file as shown below click here to see the complete file in github.I would like to know the procedure for generating word embeddings So that i can generate word embedding for my personal dataset in -0.051625 -0.063918 -0.132715 -0.122302 -0.265347 to 0.052796 0.076153 0.014475 0.096910 -0.045046 for 0.051237 -0.102637 0.049363 0.096058 -0.010658 of 0.073245 -0.061590 -0.079189 -0.095731 -0.026899 the -0.063727 -0.070157 -0.014622 -0.022271 -0.078383 on -0.035222 0.008236 -0.044824 0.075308 0.076621 and 0.038209 0.012271 0.063058 0.042883 -0.124830 a -0.060385 -0.018999 -0.034195 -0.086732 -0.025636 The 0.007047 -0.091152 -0.042944 -0.068369 -0.072737 after -0.015879 0.062852 0.015722 0.061325 -0.099242 as 0.009263 0.037517 0.028697 -0.010072 -0.013621 Google -0.028538 0.055254 -0.005006 -0.052552 -0.045671 New 0.002533 0.063183 0.070852 0.042174 0.077393 with 0.087201 -0.038249 -0.041059 0.086816 0.068579 at 0.082778 0.043505 -0.087001 0.044570 0.037580 over 0.022163 -0.033666 0.039190 0.053745 -0.035787 new 0.043216 0.015423 -0.062604 0.080569 -0.048067
I was able to convert each words in a dictionary to the above format by following the below steps: initially represent each words in the dictionary by unique integer take each integer one by one and perform array([[integer]]) and give it as input array in below code then the word corresponding to integer and respective output vector can be stored to json file ( i used output_array.tolist() for storing the vector in json format) import numpy as np from keras.models import Sequential from keras.layers import Embedding model = Sequential() model.add(Embedding(dictionary_size_here, sizeof_embedding_vector, input_length= input_length_here)) input_array = array([[integer]]) #each integer is fed one by one using a loop model.compile('rmsprop', 'mse') output_array = model.predict(input_array) Reference How does Keras 'Embedding' layer work?
It is important to understand that there are multiple ways to generate an embedding for words. The popular word2vec, for example, can generate word embeddings using CBOW or Skip-grams. Hence, one could have multiple "procedures" to generate word embeddings. One of the easier to understand method (albeit with its drawbacks) to generate an embedding is using Singular Value Decomposition (SVD). The steps are briefly described below. Create a Term-Document matrix. i.e. terms as rows and the document it appears in as columns. Perform SVD Truncate the output vector for the term to n dimension. In your example above, n = 5. You can have a look at this link for a more detailed description using word2vec's skipgram model to generate an embedding. Word2Vec Tutorial - The Skip-Gram Model. For more information on SVD, you can look at this and this.
keras: how to predict classes in order?
I'm trying to predict image classes in keras (binary classification). The model accuracy is fine, but it seems that ImageDataGenerator shuffles the input images, so I was not able to match the predicted class with the original images. datagen = ImageDataGenerator(rescale=1./255) generator = datagen.flow_from_directory( pred_data_dir, target_size=(img_width, img_height), batch_size=32, class_mode=None, shuffle=False, save_to_dir='images/aug'.format(feature)) print model.predict_generator(generator, nb_input) For example, if I have a1.jpg, a2.jpg,..., a9.jpg under pred_data_dir, I expect to get an array like [class for a1.jpg, class for a2.jpg, ... class for a9.jpg] from model.predict_generator(), but actually I got something like [class for a3.jpg, class for a8.jpg, ... class for a2.jpg] How can I resolve the issue?
Look at the source code of flow_from_directory. In my case, I had to rename all images. They were named 1.jpg .. 1000.jpg, but to be in order, they had to be named 0001.jpg .. 1000.jpg. The sorting is important here. flow_from_directory uses sorted(os.listdir(directory)), thus the sorting is not always intuitive.
The flow_from_directory() method returns a DirectoryIterator object with a filenames member that lists all the files. Since that member is used for subsequent batch generation and iteration, you should be able to use it to match your filenames to predictions. For your example, generator.filenames should give you a parallel list like ['a3.jpg', 'a8.jpg', ..., 'a2.jpg'].