Importing package in client mode PYSPARK - apache-spark

I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR

from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Related

Pyspark module not found error even after passing module as zip file

I have a Pyspark code repo which i am sending to the spark session as a zip file through --pyFile parameter. I am doing this because there is a UDF defined in one of the python files within the module which is not available when we run the code as the module is not available in the workers.
Even though all the python files are present inside the zip file i still get the module not found error.
|-Module
|----test1.py
|----test2.py
|----test3.py
When i try to from Module.test2 import foo to import in test3.py i get an error that module.test2 is not found. test2 contains an pyspark UDF.
Any help would be greatly appreciated
from pyspark.sql import SparkSession
Initiate a spark session
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
Remove the config - if not needed.
Make sure that the Module.zip is in the same directory as main.py
spark.sparkContext.addPyFile("Module.zip")
Dependency imports after adding the package.zip path to SparkSession.
from Module.test2 import foo
Finally
|-main.py - Core code to be executed.
|-Module.zip - Dependency module zipped.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
spark.sparkContext.addPyFile("Module.zip")
from Module.test2 import foo
IMPORTANT - Make sure you correctly zip you file - if the problem persists try other zip methods.

PySpark ImportError: No module named although included in --pyfiles

I'm trying to run a PySpark application. the spark submit command looks something like this.
spark-submit --py-files /some/location/data.py /path/to/the/main/file/etl.py
My main file(etl.py) imports the data.py and uses the functions from data.py file, the code looks like this.
import data
def main(args_dict):
print(args_dict)
df1 = data.get_df1(args_dict['df1name'])
df1 = data.get_df2(args_dict['df1name'])
...
...
...
I'm passing the data.py file in the --py-files, but when I run the spark-submit I'm getting ImportError: No module named 'data'
I'm trying to figure out what it is that I'm doing wrong here. Thank you.

Pyspark with Zeppelin: distributing files to cluster nodes versus SparkContext.addFile()

I have a library that I built that I want to make available to all nodes on a pyspark cluster (1.6.3). I run test programs on that spark cluster through Zeppelin (0.7.3).
The files I want are in a github repository. So I clone that repository onto all nodes of the cluster and made a script through pssh to update them all simultaneously. So the files exist at a set location on each node, and I want them accessible to each node.
I tried this
import sys
sys.path.insert(0, "/opt/repo/folder/")
from module import function
return_rdd = function(arguments)
This yielded an error stack of:
File "/usr/hdp/current/spark-client/python/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 439, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'module'
I find this error unusual since it is prompted by the pickle call. The code appears to load a dataframe and partition it, but only fail when another function within module is called on the partitioned df converted to rdd. I'm not certain where and why the pickle call is involved here; the module pyscript should not need to be pickled since the modules in question should already be in sys.path on each node of the cluster.
On the other hand, I was able to get this working by
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
Any idea why the first approach doesn't work?
A possible solution is:
sc.addFile("/opt/repo/folder/module.py")
import sys
from pyspark import SparkFiles
sys.path.insert(0, SparkFiles.getRootDirectory())
from module import function
return_rdd = function(arguments)
This is not working in cluster mode

Spark No module named found

I have a simple spark program and I get the following error -
Error:-
ImportError: No module named add_num
Command used to run :-
./bin/spark-submit /Users/workflow/test_task.py
Code:-
from __future__ import print_function
from pyspark.sql import SparkSession
from add_num import add_two_nos
def map_func(x):
print(add_two_nos(5))
return x*x
def main():
spark = SparkSession\
.builder\
.appName("test-task")\
.master("local[*]")\
.getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5]) # parallelize into 2
rdd = rdd.map(map_func) # call the image_chunk_func
print(rdd.collect())
spark.stop()
if __name__ == "__main__":
main()
function code:-
def add_two_nos(x):
return x*x
You can specify the .py file form which you wish to import in the code itself by adding a statement sc.addPyFile(Path). The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
Then use from add_num import add_two_nos
You need to include a zip containing add_num.py in your spark-submit command.
./bin/spark-submit --py-files sources.zip /Users/workflow/test_task.py
When submitting a python application to spark, all the source files imported by the main function/file(here test_task.py) should be packed in a egg or zip format and supplied to spark using --py-files option. If the main function needs only one other file, you can supply it directly without zipping it.
./bin/spark-submit --py-files add_num.py /Users/workflow/test_task.py
Above command should also work since there is only one other python source file required.

Pyspark running external program using subprocess can't read files from hdfs

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

Resources