PySpark issues with loading unfit model object - apache-spark

I was playing around with the save and load functions of pyspark.ml.classification models. I created an instance of a RandomForestClassifier, set values to a couple of parameters and called the save method of the classifier. It saves successfully. No issues there.
from pyspark.ml.classification import RandomForestClassifier
# save
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
Then I tried loading it back but I noticed that its parameters don't have the values I had set before saving. Below is the code I was trying
# load
rf2 = RandomForestClassifier()
rf2.load('rf_test')
print(rf2.getImpurity()) # returns gini
print(rf2.getPredictionCol()) # returns prediction
I guess there's a difference in my understanding of how this code should work and how it actually works.
What should I do to get back the object the way I had saved it?
EDIT
I tried the approach mentioned here. But that didn't work. This is what I tried
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier()
rf.setImpurity('entropy')
rf.setPredictionCol('predme')
rf.write().overwrite().save('rf_test')
rf2 = RandomForestClassifier
rf2.load('rf_test')
print(rf2.getImpurity())
which returned the following
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: getImpurity() missing 1 required positional argument: 'self'

That's not how you should use load method. It is a classmethod and should be called on a class object, not an instance, to return a new object:
rf2 = RandomForestClassifier.load('rf_test')
rf2.getImpurity()
Technically speaking calling it on an instance would work as well, but it doesn't modify the caller, but returns a new independent object:
rf2 = RandomForestClassifier().load('rf_test')
In practice though, such construct should be avoided.

Related

AttributeError: 'CountVectorizer' object has no attribute '_load_specials'

I am dumping my pretrained doc2vec model using below command
model.train(labeled_data,total_examples=model.corpus_count, epochs=model.epochs)
print("Model Training Done")
#Saving the created model
model.save(project_name + '_doc2vec_vectorizer.npz')
vectorizer=CountVectorizer()
vectorizer.fit(df[0])
vec_file = project_name + '_doc2vec_vectorizer.npz'
**pickle.dump(vectorizer, open(vec_file, 'wb'))**
vdb = db['vectorizers']
and then I am loading Doc2vec model using below command in another function
loaded_vectorizer = pickle.load(open(vectorizer, 'rb'))
and then getting the error CountVectorizer has no attribute _load_specials on below line i.e model2
model2= gensim.models.doc2vec.Doc2Vec.load(vectorizer)
The gensim version being used by me is 3.8.3 as I am using the LabeledSentence class
The .load() method on Gensim model classes should only be used with objects of exactly that same class that were saved to file(s) *using the Gensim .save() method.
Your code shows you trying to use Doc2Vec.load() with the vectorizer object itself (not a file path to the previously-saved model), so the error is to be expected.
If you actually want to pickle-save & then pickle-load the vectorizer object, be sure to:
use a different file path than you did for the model, or you'll overwrite the model file!
use pickle methods (not Gensim methods) to re-load anything that was pickle-saved

Loaded PyTorch model has a different result compared to saved model

I have a python script that trains and then tests a CNN model. The model weights/parameters are saved after testing through the use of:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, path + filename)
After saving I immediately load the model through the use of a function:
model_load = create_model(cnn_type="vgg", numberofclasses=len(cases))
And then, I load the model weights/parameters through:
model_load.load_state_dict(torch.load(filePath+filename), strict = False)
model_load.eval()
Finally, I feed this model the same testing data I used before the model was saved.
The problem is that the testing results are not the same when I compare the testing results of the model before saving and after loading. My hunch is that due to strict = False, some of the parameters are not being passed through to the model. However, when I make strict = True. I receive errors. Is there a work around this?
The error message is:
RuntimeError: Error(s) in loading state_dict for CNN:
Missing key(s) in state_dict: "linear.weight", "linear.bias", "linear 2.weight", "linea r2.bias", "linear 3.weight", "linear3.bias". Unexpected key(s) in state_dict: "state_dict", "optimizer".
You are loading a dictionary containing the state of your model as well as the optimizer's state. According to your error stack trace, the following should solve the issue:
>>> model_state = torch.load(filePath+filename)['state_dict']
>>> model_load.load_state_dict(model_state, strict=True)

Can't understand why "frozen" random variable isn't working right from scipy.stats in Python

I've used frozen Random Variables (RVs) from scipy.stats in Python. For reasons I can't understand I get different behavior between a script and an interactive session:
from scipy.stats import norm, lognormal
import math as math
RV = lognorm(s=.8325546, scale=math.exp(-.34657359)) # frozen RV with many attributes
print("\ntrial of lognorm: ")
print(" " + str(lnRV(2)))
fails, saying:
TypeError: 'rv_frozen' object is not callable
Oddly, I can get this to work OK in an interactive session, for both the normal and lognormal:
Any idea what's going on here?
Stupid on my part...I am defining an instance using RV, but then trying to call an instance using lnRV.

Can't use collections.defaultdict() in google-app-engine

Trying to use collections.defaultdict() to create an histogram in google-app-engine :
class myDS(ndb.Model):
values = ndb.PickleProperty()
hist = ndb.PickleProperty()
class Handler:
my_ds = myDS()
my_ds.values = {}
my_ds.hist = defaultdict(lambda : 0)
And got the error (from log)
File "/base/alloc/tmpfs/dynamic_runtimes/python27/277b61042b697c7a_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 1331, in call
newvalue = method(self, value)
File "/base/alloc/tmpfs/dynamic_runtimes/python27/277b61042b697c7a_unzipped/python27_lib/versions/1/google/appengine/ext/ndb/model.py", line 1862, in _to_base_type
return pickle.dumps(value, pickle.HIGHEST_PROTOCOL)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Any way to solve this?
A PickleProperty field require a value that is serializable using Python's pickle protocol (see docs for more info):
PickleProperty: Value is a Python object (such as a list or a dict or a
string) that is serializable using Python's pickle protocol; Cloud
Datastore stores the pickle serialization as a blob. Unindexed by
default. Optional keyword argument: compressed.
See also this answer from Martijn Pieters:
Pickle cannot handle lambdas; pickle only ever handles data, not code,
and lambdas contain code. Functions can be pickled, but just like
class definitions only if the function can be imported. A function
defined at the module level can be imported. Pickle just stores a
string in that case, the full 'path' of the function to be imported
and referenced when unpickling again.
There are multiple options to work with default values, depending on your use case.

Revit API & Dynamo, Creating a Family Parameter from Project Document

I'm trying to create a new family parameter by calling a family's document in a project document and using the FamilyManager method to edit the family. There have been about 10 people asking for this on the Dynamo forums, so I figured I'd give it a shot. Here's my Python script below:
import clr
clr.AddReference('ProtoGeometry')
from Autodesk.DesignScript.Geometry import *
clr.AddReference("RevitServices")
from RevitServices.Persistence import DocumentManager
from RevitServices.Transactions import TransactionManager
clr.AddReference("RevitAPI")
from Autodesk.Revit.DB import *
#The inputs to this node will be stored as a list in the IN variables.
familyInput = UnwrapElement(IN[0])
familySymbol = familyInput.Symbol.Family
doc = familySymbol.Document
par_name = IN[1]
par_type = ParameterType.Text
par_grp = BuiltInParameterGroup.PG_DATA
TransactionManager.Instance.EnsureInTransaction(doc)
familyDoc = doc.EditFamily(familySymbol)
OUT = familyDoc.FamilyManager.AddParameter(par_name,par_grp,par_type,False)
TransactionManager.Instance.TransactionTaskDone()
When I run the script, I get this error:
Warning: IronPythonEvaluator.EvaluateIronPythonScript operation failed.
Traceback (most recent call last):
File "<string>", line 26, in <module>
Exception: The document is currently modifiable! Close the transaction before calling EditFamily.
I'm assuming that this error is because I am opening a family document that already exists through the script and then never sending the information back to the project document? Or something similar to that. Any tips on how to get around this?
Building up on our discussion from the forum:
import clr
clr.AddReference("RevitServices")
from RevitServices.Persistence import DocumentManager
from RevitServices.Transactions import TransactionManager
doc = DocumentManager.Instance.CurrentDBDocument
clr.AddReference("RevitAPI")
from Autodesk.Revit.DB import *
par_name = IN[0]
exec("par_type = ParameterType.%s" % IN[1])
exec("par_grp = BuiltInParameterGroup.%s" % IN[2])
inst_or_typ = IN[3]
families = UnwrapElement(IN[4])
# class for overwriting loaded families in the project
class FamOpt1(IFamilyLoadOptions):
def __init__(self): pass
def OnFamilyFound(self,familyInUse, overwriteParameterValues): return True
def OnSharedFamilyFound(self,familyInUse, source, overwriteParameterValues): return True
trans1 = TransactionManager.Instance
trans1.ForceCloseTransaction() #just to make sure everything is closed down
# Dynamo's transaction handling is pretty poor for
# multiple documents, so we'll need to force close
# every single transaction we open
result = []
for f1 in families:
famdoc = doc.EditFamily(f1)
try: # this might fail if the parameter exists or for some other reason
trans1.EnsureInTransaction(famdoc)
famdoc.FamilyManager.AddParameter(par_name, par_grp, par_type, inst_or_typ)
trans1.ForceCloseTransaction()
famdoc.LoadFamily(doc, FamOpt1())
result.append(True)
except: #you might want to import traceback for a more detailed error report
result.append(False)
trans1.ForceCloseTransaction()
famdoc.Close(False)
OUT = result
image of the Dynamo graph
The error message is already telling you exactly what the problem is: "The document is currently modifiable! Close the transaction before calling EditFamily".
I assume that TransactionManager.Instance.EnsureInTransaction opens a transaction on the given document. You cannot call EditFamily with an open transaction.
That is clearly documented in the help file:
http://thebuildingcoder.typepad.com/blog/2012/05/edit-family-requires-no-transaction.html
Close the transaction before calling EditFamily, or, in this case, don't open it at all to start with.
Oh, and then, of course, you wish to modify the family document. That will indeed require a transaction, but on the family document 'familyDoc', NOT on the project document 'doc'.
I don't know whether this will be the final solution, but it might help:
familyDoc = doc.EditFamily(familySymbol)
TransactionManager.Instance.EnsureInTransaction(familyDoc)
OUT = familyDoc.FamilyManager.AddParameter(par_name,par_grp,par_type,False)
TransactionManager.Instance.TransactionTaskDone()

Resources