Can't pickle/dill SwigPyObject when serializing dict imoprted by importlib - python-3.x

I try to serialize (dill) a list containing dill-able objects which is nested inside a dict. The dict itself is imported into my main script using importlib. Calling dill.dump() raises a TypeError: can't pickle SwigPyObject objects. Here is some code with which I managed to reproduce the error for more insight.
some_config.py located under config/some_config.py:
from tensorflow.keras.optimizers import SGD
from app.feature_building import Feature
config = {
"optimizer": SGD(lr=0.001),
"features": [
Feature('method', lambda v: v + 1)
],
}
Here is the code which imports the config and tries to dill config["features"]:
import dill
import importlib.util
from config.some_config import config
spec = importlib.util.spec_from_file_location(undillable.config,"config/some_config.py")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
undillable_config = module.config
# Works prefectly fine
with open("dillable_config.pkl", "wb") as f:
dill.dump(config["features"], f)
# Raises TypeError: can't pickle SwigPyObject objects
with open("undillable_config.pkl", "wb") as f:
dill.dump(undillable_config["features"], f)
Now the part that made me wonder: When importing the config-dict with importlib it raises the error and after some debugging I found that not only config["features"] but also config["optimizer"] will be dilled. However, using normal import seems to work and it only tries to dill config["features"]
So my question is why does dill try to serialize the whole dict if it is imported by importlib instead of only the feature-list and how may this error be fixed?

After reading the answer to this question I managed to get it working by avoiding importlib and instead import the config using __import__.
filename = "config/some_config.py"
dir_name = os.path.dirname(filename)
if dir_name not in sys.path:
sys.path.append(dir_name)
file = os.path.splitext(os.path.basename(filename))[0]
config_module = __import__(file)
# Works prefectly fine now
with open("dillable_config.pkl", "wb") as f:
dill.dump(config_module.config["features"], f)

Related

Gensim: Not able to load the id2word file

I am working on topic inference on a new corpus given a previously derived lda model. I am able to load the model perfectly, while I am not able to load the id2word file to create the corpora.Dictionary object needed to map the new corpus into numbers: the load method returns a dict attribute error that I don't know why. Below is the minimal code that replicates the situation, and I have attached the code (and packages used) here.
Thank you in advance for your response...
import numpy as np
import os
import pandas as pd
import gensim
from gensim import corpora
import datetime
import nltk
model_name = "lda_sub_full_35"
dictionary_name = "lda_sub_full_35.id2word"
model_for_inference = gensim.models.LdaModel.load(model_name, mmap='r')
print('Successfully load the model')
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
I expect to have both the dictionary and the model loaded, but it turns out that when I load the dictionary, I got the below error:
File "topic_inference.py", line 31, in <module>
lda_dictionary = corpora.Dictionary.load(dictionary_name, mmap='r')
File "/topic_modeling/env/lib/python3.8/site-packages/gensim/utils.py", line 487, in load
obj._load_specials(fname, mmap, compress, subname)
AttributeError: 'dict' object has no attribute '_load_specials'```
How were the contents of the lda_sub_full_35.id2word file originally saved?
Only if it was saved by a Gensim corpora.Dictionary object's .save() method should it be loaded as you've tried, with corpora.Dictionary.load().
If, by any chance, it was just a plain Python dict saved via some other method of writing a pickle()-created object, then you would need to load it in a symmetrically-matched way. That might be as simple as:
import pickle
with open(path, 'rb') as f:
lda_dictionary = pickle.load(f)

TypeError: join() argument must be str or bytes, not 'TextIOWrapper

I have features and a target variable which I am wanting to generate a Decision Tree. However, the code is throwing an error. Since the 'out file' did not generate an error, I figured there wouldn't be an error for the 'Source.from_file' either, but there is one.
import os
from graphviz import Source
from sklearn.tree import export_graphviz
f = open("C:/Users/julia/Desktop/iris_tree.dot", 'w')
export_graphviz(
tree_clf,
out_file=f,
feature_names=sample2[0:2],
class_names=sample2[5],
rounded=True,
filled=True
)
Source.from_file(f)
As noted in the docs, from_file accepts a string path, not a file object:
filename – Filename for loading/saving the source.
Just pass the path in:
import os
from graphviz import Source
from sklearn.tree import export_graphviz
path = "C:/Users/julia/Desktop/iris_tree.dot"
f = open(path, 'w')
export_graphviz(
tree_clf,
out_file=f,
feature_names=sample2[0:2],
class_names=sample2[5],
rounded=True,
filled=True
)
Source.from_file(path)

Python module not accessible to function inside class

Code below works as expected. It prints 5 random numbers.
import numpy as np
class test_class():
def __init__(self):
self.rand_nums = self.create_rand_num()
def create_rand_num(self):
numbers = np.random.rand(5)
return numbers
myclass = test_class()
myclass.rand_nums
However, the following does not work. NameError: name 'np' is not defined
import numpy as np
from test.calc import create_rand_num
class test_class():
def __init__(self):
self.rand_nums = create_rand_num()
myclass = test_class()
myclass.rand_nums
# contents of calc.py in test folder:
def create_rand_num():
print(np.random.rand(5))
But, this works:
from test.calc import create_rand_num
class test_class():
def __init__(self):
self.rand_nums = create_rand_num()
myclass = test_class()
myclass.rand_nums
# contents of calc.py in test folder:
import numpy as np
def create_rand_num():
print(np.random.rand(5))
Why must I have 'import numpy as np' inside calc.py? I already have this import before my class definition. I am sure I am misunderstanding something here, but I was trying to follow the general rule to have all the import statements at the top of the main code.
What I find confusing is that when I say "from test.calc import create_rand_num," how does Python know whether "import numpy as np" is included at the top of calc.py or not? It must know somehow, because when I include it, the code works, but when I leave it out, the code does not work.
EDIT: After reading the response from #DeepSpace, I want to ask the following:
Suppose I have the following file.py module with contents listed as shown:
import numpy as np
import pandas as pd
import x as y
def myfunc():
pass
So, if I have another file, file1.py, and in it, I say from file.py import myfunc, do I get access to np, pd, and y? This is exactly what seems to be happening in my third example above.
In my third example, notice that np is NOT defined anywhere in the main file, it is only defined in calc.py file, and I am not importing * from calc.py, I am only importing create_rand_num. Why do I not get the same NameError error?
Python is not like C. Importing a module does not copy-paste its source. It simply adds a reference to it to the locals() "namespace". import numpy as np in one file does not make it magically available in all other files.
You have to import numpy as np in every file you want to use np.
Perhaps a worthwhile reading: https://docs.python.org/3.7/reference/simple_stmts.html#the-import-statement

Python3: pickle a function without side effects

I have a project with a function foo in a module my_project.my_functions. I want to pickle that function in a way that I can unpickle it from somewhere else without requiring to import my_project. foo does not have any side effect, so no dependencies outside the function.
I'm using dill to pickle foo, but dill is saving it as a <function my_project.my_functions.foo>, and complains about the unknown my_project module when I try to unpickle it.
Any solution?
I solved it by recreating the function from the code giving and empty globals dictionary.
In /my_project/module.py:
def f(n):
return n+1
In my_project, before pickling the function:
import dill
import types
import module
f = types.FunctionType(module.f.__code__,{})
with open("my_func.pkl", 'wb') as fs:
dill.dump(f, fs)
Somewhere else:
import dill
with open("my_func.pkl", 'rb') as fs:
f = dill.load(fs)

Python Unittest for big arrays

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

Resources