dask worker cannot import module - python-3.x

I am running a dask cluster and a worker w. 16 cores using the CLI utilities.
In general it seems to work very well.
However, for some reason it will not import modules in the cwd.
I try to run the following from my notebook instance:
def tstimp():
import os
return os.listdir()
c.run(tstimp)
And i get the following output:
{'tcp://192.168.1.90:35885': ['class_positions.csv',
'.gitignore',
'README.md',
'fullrun.ipynb',
'.git',
'rf.py',
'__pycache__',
'dask-worker-space',
'utils.py',
'.ipynb_checkpoints']}
Note that the module rf.py is listed here.
Thus it should be possible to import it in the worker, but when i run the following code:
def tstimp():
import rf
return 42
c.run(tstimp)
I get this error: ModuleNotFoundError: No module named 'rf'
Why am I getting this error?

It seems like the current directory is not added to the python path of the workers.
You should be able to fix this by adding it to the path.
def tstimp():
import sys
sys.path.append('.')
import rf
return 42
c.run(tstimp)

Related

JupyterLab 3: how to get the list of running servers

Since JupyterLab 3.x jupyter-server is used instead of the classic notebook server, and the following code does not list servers served with jupyter_server:
from notebook import notebookapp
notebookapp.list_running_servers()
None
What still works for the file/notebook name is:
from time import sleep
from IPython.display import display, Javascript
import subprocess
import os
import uuid
def get_notebook_path_and_save():
magic = str(uuid.uuid1()).replace('-', '')
print(magic)
# saves it (ctrl+S)
# display(Javascript('IPython.notebook.save_checkpoint();')) # Javascript Error: IPython is not defined
nb_name = None
while nb_name is None:
try:
sleep(0.1)
nb_name = subprocess.check_output(f'grep -l {magic} *.ipynb', shell=True).decode().strip()
except:
pass
return os.path.join(os.getcwd(), nb_name)
But it's not pythonic nor fast
How to get the current running server instances - and so e.g. the current notebook file?
Migration to jupyter_server should be as easy as changing notebook to jupyter_server, notebookapp to serverapp and changing the appropriate configuration files - the server-related codebase is largely unchanged. In the case of listing servers simply use:
from jupyter_server import serverapp
serverapp.list_running_servers()

OpenMDAO and NSGA II

I found some interesting code in openmdao\drivers\tests\test_pyoptsparse_driver.py that seems to reference NSGA-II. I noticed that this is not implemented when I tried running the test code.
import sys
import copy
import unittest
sys.path.insert(0,r"[SOMEPATH Here]\GitHub\OpenMDAO")
from distutils.version import LooseVersion
import numpy as np
import openmdao.api as om
from openmdao.test_suite.components.paraboloid import Paraboloid
from openmdao.test_suite.components.expl_comp_array import TestExplCompArrayDense
from openmdao.test_suite.components.sellar import SellarDerivativesGrouped
# from openmdao.utils.assert_utils import assert_near_equal # NOTE: THIS FUNCTION ISN'T AVAILABLE IN THE PIP INSTALL
from openmdao.utils.general_utils import set_pyoptsparse_opt, run_driver
from openmdao.utils.testing_utils import use_tempdirs
from openmdao.utils.mpi import MPI
_, local_opt = set_pyoptsparse_opt('NSGA2')
if local_opt != 'NSGA2':
raise unittest.SkipTest("pyoptsparse is not providing NSGA2") # CODE BASICALLY FAILS HERE
Error that I am seeing:
"pyoptsparse is not providing NSGA2"
Can I add NSGA 2 if it's not available?
when that test was written, NSGA-II was a little difficult to compile with pyoptsparse. I think there are still some challenges with it, but it mostly works now. As of OpenMDAO V3.0 we're not using NSGA-II for anything internally. But if you get it to work, feel free to send a PR with an updated test!

Python import files from 3 layers

I have the following file structure
home/user/app.py
home/user/content/resource.py
home/user/content/call1.py
home/user/content/call2.py
I have imported resources.py in app.py as below:
import content.resource
Also, I have imported call1 and call2 in resource.py
import call1
import call2
The requirement is to run two tests individually.
run app.py
run resource.py
When I run app.py, it says cannot find call1 and call2.
When run resource.py, the file is running without any issues. How to run app.py python file to call import functions in resource.py and also call1.py and call2.py files?
All the 4 files having __init__ main function.
In your __init__ files, just create a list like this for each init, so for your user __init__: __all__ = ["app", "content"]
And for your content __init__: __all__ = ["resource", "call1", "call2"]
First try: export PYTHONPATH=/home/user<-- Make sure this is the correct absolute path.
If that doesn't solve the issue, try adding content to the path as well.
try: export PYTHONPATH=/home/user/:/home/user/content/
This should definitely work.
You will then import like so:
import user.app
import user.content.resource
NOTE
Whatever you want to use, you must import in every file. Don't bother importing in __init__. Just mention whatever modules that __init__ includes by doing __all__ = []
You have to import call1 and call2 in app.py if you want to call them there.

Import of an attribute of a python module fails

I have the following directory structure:
http://localhost:8888/notebooks/translation.ipynb
http://localhost:8888/edit/Fill_temp/prepare_test_data.py
In
prepare_test_data.py
I have a function:
def to_cap (EXP_FILE, SAMPLES_FILE: str= EXP_FILE + '.cap', cap_rate=0, by_token=False):
In the notebook
translation.ipynb
I do these imports:
%load_ext autoreload
%autoreload 2
import Fill_temp
import Fill_temp.prepare_test_data
then I run
Fill_temp.prepare_test_data.to_cap("en12.json.pres", "en12.cap.0")
and I get
AttributeError: module 'Fill_temp.prepare_test_data' has no attribute 'to_cap'
How come?
I explicitly imported both the Fill_temp package and the prepare_test_data module.
Do I need to import even the lowest level functions that are defined in the module?
EDIT:
I tried to import the low level function explicitly:
%load_ext autoreload
%autoreload 2
import Fill_temp
import Fill_temp.prepare_test_data
import Fill_temp.prepare_test_data.to_cap
but I get:
ModuleNotFoundError: No module named
'Fill_temp.prepare_test_data.to_cap';
'Fill_temp.prepare_test_data' is not a package
So what shall I do?
This is a bit bizarre. Basically, it turned out that there was a syntax error in that low level function.
But instead of saying it jupyter was saying that it doesn't see that function. Which is a really counter-intuitive error message.

Pyspark with Zeppelin: distributing files to cluster nodes versus SparkContext.addFile()

I have a library that I built that I want to make available to all nodes on a pyspark cluster (1.6.3). I run test programs on that spark cluster through Zeppelin (0.7.3).
The files I want are in a github repository. So I clone that repository onto all nodes of the cluster and made a script through pssh to update them all simultaneously. So the files exist at a set location on each node, and I want them accessible to each node.
I tried this
import sys
sys.path.insert(0, "/opt/repo/folder/")
from module import function
return_rdd = function(arguments)
This yielded an error stack of:
File "/usr/hdp/current/spark-client/python/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 439, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'module'
I find this error unusual since it is prompted by the pickle call. The code appears to load a dataframe and partition it, but only fail when another function within module is called on the partitioned df converted to rdd. I'm not certain where and why the pickle call is involved here; the module pyscript should not need to be pickled since the modules in question should already be in sys.path on each node of the cluster.
On the other hand, I was able to get this working by
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
Any idea why the first approach doesn't work?
A possible solution is:
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
This is not working in cluster mode

Resources