Python Unittest for big arrays - python-3.x

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

Related

How do I suppress the output from log step while running jupyter notebook?

On this page, there is this log_step function which is being used to record what each step in a pandas pipeline is doing. The exact function is:
from functools import wraps
import datetime as dt
def log_step(func):
#wraps(func)
def wrapper(*args, **kwargs):
tic = dt.datetime.now()
result = func(*args, **kwargs)
time_taken = str(dt.datetime.now() - tic)
print(f"just ran step {func.__name__} shape={result.shape} took {time_taken}s")
return result
return wrapper
and it is used in the following fashion:
import pandas as pd
df = pd.read_csv('https://calmcode.io/datasets/bigmac.csv')
#log_step
def start_pipeline(dataf):
return dataf.copy()
#log_step
def set_dtypes(dataf):
return (dataf
.assign(date=lambda d: pd.to_datetime(d['date']))
.sort_values(['currency_code', 'date']))
My question is: how do I keep the #log_step in front of my functions and be able to use them at will, while setting the results of #log_step by default, to not be outputed when I run my Jupyter notebook? I suspect the answer comes down to something more general about using decorators but I don't really know what to look for. Thanks!
You can indeed remove the print statement or, if you want to not alter the decorating function, you can as well redirect the sys to avoid seeing the prints, as explained here.

Attribute Error : pickle.load() Seldon Deployment

I am doing a seldon deployment. I have created custom pipelines using sklearn and it is in the directory MyPipelines/CustomPipelines.py. The main code ie. my_prediction.py is the file which seldon will execute by default (based on my configuration). In this file I am importing the custom pipelines. If I execute my_prediction.py in my local (PyCharm) it executes fine. But If I deploy it using Seldon I get the error : Attribute Error: Can't get Attribute 'MyEncoder'
It is unable to load modules in CustomPipelines.py. I tried all the solutions from Unable to load files using pickle and multiple modules None of them worked.
MyPipelines/CustomPipelines.py
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
class MyEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
super().__init__()
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df = X
vars_cat = [var for var in df.columns if df[var].dtypes == 'O']
cat_with_na = [var for var in vars_cat if df[var].isnull().sum() > 0]
df[cat_with_na] = df[cat_with_na].fillna('Missing')
return df
my_prediction.py
import pickle
import pandas as pd
import dill
from MyPipelines.CustomPipelines import MyEncoder
from MyPipelines.CustomPipelines import *
import MyPipelines.CustomPipelines
class my_prediction:
def __init__(self):
file_name = 'model.sav'
with open(file_name, 'rb') as model_file:
self.model = pickle.load(model_file)
def predict(self, request):
data = request.get('ndarray')
columns = request.get('names')
X = pd.DataFrame(data, columns = columns)
predictions = self.model.predict(X)
return predictions
Error:
File microservice/my_prediction.py in __init__
self.model = pickle.load(model_file)
Attribute Error: Can't get Attribute 'MyEncoder' on <module '__main__' from 'opt/conda/bin/seldon-core-microservice'
One of the constrains of the pickle module is that it expects that the same classes (under the same module) are available in the environment where the artifact gets unpickled. In this case, it seems like your my_prediction class is trying to unpickle a MyEncoder artifact, but that class is not available on that environment.
As a quick workaround, you could try to make your MyEncoder class available on the environment where my_prediction runs on (i.e. having the same folders / files present there as well). Otherwise, you could look at alternatives to pickle, like cloudpickle or dill, which can serialise your custom code as well (although these also come with their own set of caveats).

How to write a subclass of numpy.ndarray which only takes complex values?

I would like to create a subclass of numpy.ndarray which is an array of complex number. To that purpose, I'm trying to make the constructor of my sublass such that it returns an array of (0+0j). I'm unsuccessful for the moment...
Here is my code so far :
import numpy as np
class ComplexArray(np.ndarray):
def __init__(self, args):
np.ndarray.__init__(args, dtype=complex)
self.fill(0)
a = ComplexArray(3)
a[0] = 1j
When I run the above code, I get the error TypeError: can't convert complex to float.
I specify that the reason why I want to create such a subclass is that I want to implement several methods in it afterwards.
Thank you in advance for your advice !
I have found a solution :
import numpy as np
class ComplexArray(np.ndarray):
def __new__(cls, n):
ret = np.zeros(n, dtype=complex)
return ret.view(cls)

Python module not accessible to function inside class

Code below works as expected. It prints 5 random numbers.
import numpy as np
class test_class():
def __init__(self):
self.rand_nums = self.create_rand_num()
def create_rand_num(self):
numbers = np.random.rand(5)
return numbers
myclass = test_class()
myclass.rand_nums
However, the following does not work. NameError: name 'np' is not defined
import numpy as np
from test.calc import create_rand_num
class test_class():
def __init__(self):
self.rand_nums = create_rand_num()
myclass = test_class()
myclass.rand_nums
# contents of calc.py in test folder:
def create_rand_num():
print(np.random.rand(5))
But, this works:
from test.calc import create_rand_num
class test_class():
def __init__(self):
self.rand_nums = create_rand_num()
myclass = test_class()
myclass.rand_nums
# contents of calc.py in test folder:
import numpy as np
def create_rand_num():
print(np.random.rand(5))
Why must I have 'import numpy as np' inside calc.py? I already have this import before my class definition. I am sure I am misunderstanding something here, but I was trying to follow the general rule to have all the import statements at the top of the main code.
What I find confusing is that when I say "from test.calc import create_rand_num," how does Python know whether "import numpy as np" is included at the top of calc.py or not? It must know somehow, because when I include it, the code works, but when I leave it out, the code does not work.
EDIT: After reading the response from #DeepSpace, I want to ask the following:
Suppose I have the following file.py module with contents listed as shown:
import numpy as np
import pandas as pd
import x as y
def myfunc():
pass
So, if I have another file, file1.py, and in it, I say from file.py import myfunc, do I get access to np, pd, and y? This is exactly what seems to be happening in my third example above.
In my third example, notice that np is NOT defined anywhere in the main file, it is only defined in calc.py file, and I am not importing * from calc.py, I am only importing create_rand_num. Why do I not get the same NameError error?
Python is not like C. Importing a module does not copy-paste its source. It simply adds a reference to it to the locals() "namespace". import numpy as np in one file does not make it magically available in all other files.
You have to import numpy as np in every file you want to use np.
Perhaps a worthwhile reading: https://docs.python.org/3.7/reference/simple_stmts.html#the-import-statement

How to distibute classes with PySpark and Jupyter

I have an annoying problem using jupyter notebook with spark.
I need to define a custom class inside the notebook and use it to perform some map operations
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
conf = SparkConf().setMaster("spark://192.168.10.11:7077")\
.setAppName("app_jupyter/")\
.set("spark.cores.max", "10")
sc = SparkContext(conf=conf)
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
class demo(object):
def __init__(self, value):
self.test = value + 10
pass
distData.map(lambda x : demo(x)).collect()
It gives the following error:
PicklingError: Can't pickle : attribute lookup
main.demo failed
I know what this error is about, but I could't figure out a solution..
I have tried:
Define a demo.py python file outside the notebook. It works, but it is such a ugly solution ...
Create a dynamic module like this, and then import it afterwards... This gives the same error
What would be a solution?...I want everything to work in the same notebook
It is possible to change something in:
The way spark works, maybe some pickle configuration
Something in the code... Use some static magic approach
There is no reliable and elegant workaround here and this behavior is not particularly related to Spark. This is all about fundamental design of the Python pickle
pickle can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.
Theoretically you could define a custom cell magic which would:
Write the content of a cell to a module.
Import it.
Call SparkContext.addPyFile to distribute the module.
from IPython.core.magic import register_cell_magic
import importlib
#register_cell_magic
def spark_class(line, cell):
module = line.strip()
f = "{0}.py".format(module)
with open(f, "w") as fw:
fw.write(cell)
globals()[module] = importlib.import_module(module)
sc.addPyFile(f)
In [2]: %%spark_class foo
...: class Foo(object):
...: def __init__(self, x):
...: self.x = x
...: def __repr__(self):
...: return "Foo({0})".format(self.x)
...:
In [3]: sc.parallelize([1, 2, 3]).map(lambda x: foo.Foo(x)).collect()
Out[3]: [Foo(1), Foo(2), Foo(3)]
but it is a one time deal. Once file is marked for distribution it cannot be changed or redistributed. Moreover there is a problem of reloading imports on remote hosts. I can think of some more elaborate schemes but this is simply more trouble than it is worth.
The answer from zero323 is solid: there's no one "right" way to solve this problem. You could indeed use Jupyter magic, as proposed. One other way is to use Jupyter's %%writefile to have your code inline in a Jupyter cell but to then write it to disk as a python file. Then you can both import the file to your Jupyter kernel session as well as ship it with your PySpark job (via addPyFile() as noted in the other answer). Note that if you make changes to the code but don't restart your PySpark session, you'll have to get the updated code to PySpark somehow.
Can we make this easier? I wrote a blogpost about this topic as well as a PySpark Session wrapper (oarphpy.spark.NBSpark) to help automate a lot of the tricky stuff. See the Jupyter Notebook embedded in that post for a working example. The overall pattern looks like this:
import os
import sys
CUSTOM_LIB_SRC_DIR = '/tmp/'
os.chdir(CUSTOM_LIB_SRC_DIR)
!mkdir -p mymodule
!touch mymodule/__init__.py
%%writefile mymodule/foo.py
class Zebra(object):
def __init__(self, name):
self.name = name
sys.path.append(CUSTOM_LIB_SRC_DIR)
from mymodule.foo import Zebra
# Create Zebra() instances in the notebook
herd = [Zebra(name=str(i)) for i in range(10)]
# Now send those instances to PySpark!
from oarphpy.spark import NBSpark
NBSpark.SRC_ROOT = os.path.join(CUSTOM_LIB_SRC_DIR, 'mymodule')
spark = NBSpark.getOrCreate()
rdd = spark.sparkContext.parallelize(herd)
def get_name(z):
return z.name
names = rdd.map(get_name).collect()
Additionally, if you make any changes to the mymodule files on disk (via %%writefile or otherwise), then NBSpark with automatically ship those changes to the active PySpark session.

Resources