I am dumping my pretrained doc2vec model using below command
model.train(labeled_data,total_examples=model.corpus_count, epochs=model.epochs)
print("Model Training Done")
#Saving the created model
model.save(project_name + '_doc2vec_vectorizer.npz')
vectorizer=CountVectorizer()
vectorizer.fit(df[0])
vec_file = project_name + '_doc2vec_vectorizer.npz'
**pickle.dump(vectorizer, open(vec_file, 'wb'))**
vdb = db['vectorizers']
and then I am loading Doc2vec model using below command in another function
loaded_vectorizer = pickle.load(open(vectorizer, 'rb'))
and then getting the error CountVectorizer has no attribute _load_specials on below line i.e model2
model2= gensim.models.doc2vec.Doc2Vec.load(vectorizer)
The gensim version being used by me is 3.8.3 as I am using the LabeledSentence class
The .load() method on Gensim model classes should only be used with objects of exactly that same class that were saved to file(s) *using the Gensim .save() method.
Your code shows you trying to use Doc2Vec.load() with the vectorizer object itself (not a file path to the previously-saved model), so the error is to be expected.
If you actually want to pickle-save & then pickle-load the vectorizer object, be sure to:
use a different file path than you did for the model, or you'll overwrite the model file!
use pickle methods (not Gensim methods) to re-load anything that was pickle-saved
Related
I have a python script that trains and then tests a CNN model. The model weights/parameters are saved after testing through the use of:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, path + filename)
After saving I immediately load the model through the use of a function:
model_load = create_model(cnn_type="vgg", numberofclasses=len(cases))
And then, I load the model weights/parameters through:
model_load.load_state_dict(torch.load(filePath+filename), strict = False)
model_load.eval()
Finally, I feed this model the same testing data I used before the model was saved.
The problem is that the testing results are not the same when I compare the testing results of the model before saving and after loading. My hunch is that due to strict = False, some of the parameters are not being passed through to the model. However, when I make strict = True. I receive errors. Is there a work around this?
The error message is:
RuntimeError: Error(s) in loading state_dict for CNN:
Missing key(s) in state_dict: "linear.weight", "linear.bias", "linear 2.weight", "linea r2.bias", "linear 3.weight", "linear3.bias". Unexpected key(s) in state_dict: "state_dict", "optimizer".
You are loading a dictionary containing the state of your model as well as the optimizer's state. According to your error stack trace, the following should solve the issue:
>>> model_state = torch.load(filePath+filename)['state_dict']
>>> model_load.load_state_dict(model_state, strict=True)
This error occur in Maxpooling stage while i train my CNN model
Error: Attribute Error: 'None Type' object has no attribute 'current'. Please help.
model = model.add(MaxPooling2D(pool_size=(2,2),input_shape=(48,48,1)))
The question is missing some info, but I think I can see what's going on.
Assuming that model was at some point a tf.models.Sequential() I guess you did something like:
model = models.Sequential()
model = model.add(...)
model = model.add(MaxPooling2D(pool_size=(2,2),input_shape=(48,48,1)))
However, that's not quite how model.add(..) works. Instead of returning a new model, it modifies the existing model.
Instead you should do something like:
model = models.Sequential() # create a first model
model.add(...) # add things to the existing model
model.add(MaxPooling2D(pool_size=(2,2),input_shape=(48,48,1)))
I am trying to run hugginface gpt2-xl model. I ran code from the quickstart page that load the small gpt2 model and generate text by the following code:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
generated = tokenizer.encode("The Manhattan bridge")
context = torch.tensor([generated])
past = None
for i in range(100):
print(i)
output, past = model(context, past=past)
token = torch.argmax(output[0, :])
generated += [token.tolist()]
context = token.unsqueeze(0)
sequence = tokenizer.decode(generated)
print(sequence)
This is running perfectly. Then I try to run gpt2-xl model.
I changed tokenizer and model loading code like following:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
model = GPT2LMHeadModel.from_pretrained('gpt2-xl')
The tokenizer and model loaded perfectly. But I a getting error on the following line:
output, past = model(context, past=past)
The error is:
RuntimeError: index out of range: Tried to access index 204483 out of table with 50256 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
Looking at error it seems that the embedding size is not correct. So I write the following line to specifically fetch the config file of gpt2-xl:
config = GPT2Config.from_pretrained("gpt2-xl")
But, here vocab_size:50257
So I changed explicitly the value by:
config.vocab_size=204483
Then after printing the config, I can see that the previous line took effect in the configuration. But still, I am getting the same error.
This was actually an issue I reported and they fixed it.
https://github.com/huggingface/transformers/issues/2774
tryint to use the test after I load the model
net = net.load_state_dict(torch.load(PATH))
net.eval()
but this spit the error
net.eval() AttributeError: '_IncompatibleKeys' object has no attribute 'eval'
here you don't need to assign net.load_state_dict to net
net = net.load_state_dict(torch.load(PATH))
just used
net.load_state_dict(torch.load(PATH))
net.eval()
more see Recommended approach for saving a model
I was playing around with the save and load functions of pyspark.ml.classification models. I created an instance of a RandomForestClassifier, set values to a couple of parameters and called the save method of the classifier. It saves successfully. No issues there.
from pyspark.ml.classification import RandomForestClassifier
# save
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
Then I tried loading it back but I noticed that its parameters don't have the values I had set before saving. Below is the code I was trying
# load
rf2 = RandomForestClassifier()
rf2.load('rf_test')
print(rf2.getImpurity()) # returns gini
print(rf2.getPredictionCol()) # returns prediction
I guess there's a difference in my understanding of how this code should work and how it actually works.
What should I do to get back the object the way I had saved it?
EDIT
I tried the approach mentioned here. But that didn't work. This is what I tried
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
rf2 = RandomForestClassifier
rf2.load('rf_test')
print(rf2.getImpurity())
which returned the following
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: getImpurity() missing 1 required positional argument: 'self'
That's not how you should use load method. It is a classmethod and should be called on a class object, not an instance, to return a new object:
rf2 = RandomForestClassifier.load('rf_test')
rf2.getImpurity()
Technically speaking calling it on an instance would work as well, but it doesn't modify the caller, but returns a new independent object:
rf2 = RandomForestClassifier().load('rf_test')
In practice though, such construct should be avoided.