Pyspark with Zeppelin: distributing files to cluster nodes versus SparkContext.addFile() - python-3.x

I have a library that I built that I want to make available to all nodes on a pyspark cluster (1.6.3). I run test programs on that spark cluster through Zeppelin (0.7.3).
The files I want are in a github repository. So I clone that repository onto all nodes of the cluster and made a script through pssh to update them all simultaneously. So the files exist at a set location on each node, and I want them accessible to each node.
I tried this
import sys
sys.path.insert(0, "/opt/repo/folder/")
from module import function
return_rdd = function(arguments)
This yielded an error stack of:
File "/usr/hdp/current/spark-client/python/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 439, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'module'
I find this error unusual since it is prompted by the pickle call. The code appears to load a dataframe and partition it, but only fail when another function within module is called on the partitioned df converted to rdd. I'm not certain where and why the pickle call is involved here; the module pyscript should not need to be pickled since the modules in question should already be in sys.path on each node of the cluster.
On the other hand, I was able to get this working by
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
Any idea why the first approach doesn't work?

A possible solution is:
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
This is not working in cluster mode

Related

Pyspark module not found error even after passing module as zip file

I have a Pyspark code repo which i am sending to the spark session as a zip file through --pyFile parameter. I am doing this because there is a UDF defined in one of the python files within the module which is not available when we run the code as the module is not available in the workers.
Even though all the python files are present inside the zip file i still get the module not found error.
|-Module
|----test1.py
|----test2.py
|----test3.py
When i try to from Module.test2 import foo to import in test3.py i get an error that module.test2 is not found. test2 contains an pyspark UDF.
Any help would be greatly appreciated
from pyspark.sql import SparkSession
Initiate a spark session
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
Remove the config - if not needed.
Make sure that the Module.zip is in the same directory as main.py
spark.sparkContext.addPyFile("Module.zip")
Dependency imports after adding the package.zip path to SparkSession.
from Module.test2 import foo
Finally
|-main.py - Core code to be executed.
|-Module.zip - Dependency module zipped.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
spark.sparkContext.addPyFile("Module.zip")
from Module.test2 import foo
IMPORTANT - Make sure you correctly zip you file - if the problem persists try other zip methods.

Error Launching Blob Trigger function in Azure Functions expected str, bytes or os.PathLike object, not PosixPath

My problem is: I try to execute a fresh uploaded python function in an Azure Function App service and launch it (no matter if I use blob trigger or http trigger) I allways get the same error:
Exception while executing function: Functions.TestBlobTrigger Result: Failure
Exception: TypeError: expected str, bytes or os.PathLike object, not PosixPath
Stack: File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 284, in _handle__function_load_request
func = loader.load_function(
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/utils/wrappers.py", line 40, in call
return func(*args, **kwargs)
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/loader.py", line 53, in load_function
register_function_dir(dir_path.parent)
File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/loader.py", line 26, in register_function_dir
_submodule_dirs.append(fspath(path))
Why is this happening: when the function is successfully deployed I upload a file in a blob in order to trigger the function but I get allways the same error, caused by the pathlib library (https://pypi.org/project/pathlib/). I have written a very easy function that works in my local vscode and it just prints a message.
import logging
import configparser
import azure.functions as func
from azure.storage.blob import BlockBlobService
import os
import datetime
import io
import json
import calendar
import aanalytics2 as api2
import time
import pandas as pd
import csv
from io import StringIO
def main(myblob: func.InputStream):
logging.info("BLob Trigger function Launched ");
blob_bytes = myblob.read();
blobmessage=blob_bytes.decode()
func1 = PythonAPP.callMain();
func1.main(blobmessage);
The Pythonapp class is:
class PythonAPP:
def __init__(self):
logging.info('START extractor. ');
self.parameter="product";
def main(self,message1):
var1="--";
try:
var1="---";
logging.info('END: ->paramet '+str(message1));
except Exception as inst:
logging.error("Error PythonAPP.main : " + str(inst));
return var1;
My requirements.txt file is:
azure-storage-blob== 0.37.1
azure-functions-durable
azure-functions
pandas
xlrd
pysftp
openpyxl
configparser
PyJWT==1.7.1
pathlib
dicttoxml
requests
aanalytics2
I've created this simple function in order to check if I can upload the simpliest example in Azure Functions, is there any dependencies that am I forgetting?
Checking the status of the functions I found this:
------------UPDATE1--------------------
The function is failing because the pathlib import, this is because in the requirements of the function it downloads this library and fails with AZ functions. Please see the requirements.txt file in the following link: https://github.com/pitchmuc/adobe_analytics_api_2.0/blob/master/requirements.txt
Can I exlude it somehow?
Well I can't provide an answer for that, I made a walkarround. In this case I created a copy of the library in a github repository. In this copy I erased the references to the pathlib in the requrements.txt and setup.py because this libary causes the failure in the AZ function APPS. By the way in the requirements file of the proyect make a reference to the project, so please mind the requiremnts file that I wrote above and change aanalytics2 reference to:
git+git://github.com/theURLtotherepository.git#master#egg=theproyectname
LINKS.
I've checked a lot of examples in google but none of them helped me:
Azure function failing after successfull deployment with OSError: [Errno 107]
https://github.com/Azure/azure-functions-host/issues/6835
https://learn.microsoft.com/en-us/answers/questions/39865/azure-functions-python-httptrigger-with-multiple-s.html
https://learn.microsoft.com/en-us/answers/questions/147627/azure-functions-modulenotfounderror-for-python-scr.html
Missing Dependencies on Python Azure Function --> no bundler flag or –build remote
https://github.com/OpenMined/PySyft/issues/3400
This is a bug in the azure codebase itself; specifically
within:
azure-functions-python-worker/azure_functions_worker/dispatcher.py
the problematic code within dispatcher looks to be setting up the exposed functions with the metadata parameters found within function.json
you will not encounter the issue if you're not using pathlib within your function app / web app code
if pathlib is available the issue manifests
rather than the simple os.path strings pathlib.Path objects are exposed - deeper down in the codebase there looks to be a conditional import and use of pathlib
to resolve simply remove the pathlib from your requirements.txt and redeploy
you'll need to refactor any of your own code that used pathlib - use equivalent methods in the os module
there looks to have been a ticket opened to resolve this around the time of the OP post - but it is notresolved in the current release

Python import files from 3 layers

I have the following file structure
home/user/app.py
home/user/content/resource.py
home/user/content/call1.py
home/user/content/call2.py
I have imported resources.py in app.py as below:
import content.resource
Also, I have imported call1 and call2 in resource.py
import call1
import call2
The requirement is to run two tests individually.
run app.py
run resource.py
When I run app.py, it says cannot find call1 and call2.
When run resource.py, the file is running without any issues. How to run app.py python file to call import functions in resource.py and also call1.py and call2.py files?
All the 4 files having __init__ main function.
In your __init__ files, just create a list like this for each init, so for your user __init__: __all__ = ["app", "content"]
And for your content __init__: __all__ = ["resource", "call1", "call2"]
First try: export PYTHONPATH=/home/user<-- Make sure this is the correct absolute path.
If that doesn't solve the issue, try adding content to the path as well.
try: export PYTHONPATH=/home/user/:/home/user/content/
This should definitely work.
You will then import like so:
import user.app
import user.content.resource
NOTE
Whatever you want to use, you must import in every file. Don't bother importing in __init__. Just mention whatever modules that __init__ includes by doing __all__ = []
You have to import call1 and call2 in app.py if you want to call them there.

dask worker cannot import module

I am running a dask cluster and a worker w. 16 cores using the CLI utilities.
In general it seems to work very well.
However, for some reason it will not import modules in the cwd.
I try to run the following from my notebook instance:
def tstimp():
import os
return os.listdir()
c.run(tstimp)
And i get the following output:
{'tcp://192.168.1.90:35885': ['class_positions.csv',
'.gitignore',
'README.md',
'fullrun.ipynb',
'.git',
'rf.py',
'__pycache__',
'dask-worker-space',
'utils.py',
'.ipynb_checkpoints']}
Note that the module rf.py is listed here.
Thus it should be possible to import it in the worker, but when i run the following code:
def tstimp():
import rf
return 42
c.run(tstimp)
I get this error: ModuleNotFoundError: No module named 'rf'
Why am I getting this error?
It seems like the current directory is not added to the python path of the workers.
You should be able to fix this by adding it to the path.
def tstimp():
import sys
sys.path.append('.')
import rf
return 42
c.run(tstimp)

Running neo4j-Python code in Eclipse with Pydev under ArchLinux

so I installed neo4j on ArchLinux (AUR Link) and want to test it using python 3.2.
I am using python 3.2, Eclipse with Pydev.
I tried following code from the neo4j website, allthough I think it was still 2.7 python code and I tried to convert it to Python 3.2 code.
Here's the code:
import os
libpath = '/usr/share/java/neo4j'
os.environ['CLASSPATH'] = ';'.join( [ os.path.abspath(p) for p in
os.listdir(libpath)])
from neo4j import GraphDatabase
# Create a database
db = GraphDatabase('/home/USERNAME/.db/neo4j/HelloWorld')
# All write operations happen in a transaction
with db.transaction:
firstNode = db.node(name='Hello')
secondNode = db.node(name='world!')
# Create a relationship with type 'knows'
relationship = firstNode.knows(secondNode, name='graphy')
# Read operations can happen anywhere
message = ' '.join([firstNode['name'], relationship['name'], secondNode['name']])
print(message)
# Delete the data
with db.transaction:
firstNode.knows.single.delete()
firstNode.delete()
secondNode.delete()
# Always shut down your database when your application exits
db.shutdown()
But I get following error message:
Traceback (most recent call last):
File "/home/USERNAME/PATH/TO/src/neo4j-HelloWorld.py", line 12, in <module>
from neo4j import GraphDatabase
File "/usr/lib/python3.2/site-packages/neo4j_embedded-1.6-py3.2.egg/neo4j/__init__.py", line 29, in <module>
from neo4j.core import GraphDatabase, Direction, NotFoundException, BOTH, ANY, INCOMING, OUTGOING
File "/usr/lib/python3.2/site-packages/neo4j_embedded-1.6-py3.2.egg/neo4j/core.py", line 19, in <module>
from _backend import *
ImportError: No module named _backend
I just can't figure out what's wrong!
I tried to set the CLASSPATH as described here, but it doesn't change anything.
I would really appreciate any help!
Did you run the code through 2to3?
If not, I suggest you do.
I think the problem is that the relative import syntax changed in 3.x, see PEP328 for details.
e.g. the offending import in core.py should probably say from ._backend import *

Resources