Spark No module named found - apache-spark

I have a simple spark program and I get the following error -
Error:-
ImportError: No module named add_num
Command used to run :-
./bin/spark-submit /Users/workflow/test_task.py
Code:-
from __future__ import print_function
from pyspark.sql import SparkSession
from add_num import add_two_nos
def map_func(x):
print(add_two_nos(5))
return x*x
def main():
spark = SparkSession\
.builder\
.appName("test-task")\
.master("local[*]")\
.getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5]) # parallelize into 2
rdd = rdd.map(map_func) # call the image_chunk_func
print(rdd.collect())
spark.stop()
if __name__ == "__main__":
main()
function code:-
def add_two_nos(x):
return x*x

You can specify the .py file form which you wish to import in the code itself by adding a statement sc.addPyFile(Path). The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
Then use from add_num import add_two_nos

You need to include a zip containing add_num.py in your spark-submit command.
./bin/spark-submit --py-files sources.zip /Users/workflow/test_task.py
When submitting a python application to spark, all the source files imported by the main function/file(here test_task.py) should be packed in a egg or zip format and supplied to spark using --py-files option. If the main function needs only one other file, you can supply it directly without zipping it.
./bin/spark-submit --py-files add_num.py /Users/workflow/test_task.py
Above command should also work since there is only one other python source file required.

Related

Importing package in client mode PYSPARK

I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR
from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Pyspark module not found error even after passing module as zip file

I have a Pyspark code repo which i am sending to the spark session as a zip file through --pyFile parameter. I am doing this because there is a UDF defined in one of the python files within the module which is not available when we run the code as the module is not available in the workers.
Even though all the python files are present inside the zip file i still get the module not found error.
|-Module
|----test1.py
|----test2.py
|----test3.py
When i try to from Module.test2 import foo to import in test3.py i get an error that module.test2 is not found. test2 contains an pyspark UDF.
Any help would be greatly appreciated
from pyspark.sql import SparkSession
Initiate a spark session
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
Remove the config - if not needed.
Make sure that the Module.zip is in the same directory as main.py
spark.sparkContext.addPyFile("Module.zip")
Dependency imports after adding the package.zip path to SparkSession.
from Module.test2 import foo
Finally
|-main.py - Core code to be executed.
|-Module.zip - Dependency module zipped.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
spark.sparkContext.addPyFile("Module.zip")
from Module.test2 import foo
IMPORTANT - Make sure you correctly zip you file - if the problem persists try other zip methods.

How to get the SparkSession to find added python files

After running pip install BigDL==0.8.0, running from bigdl.util.common import * from python completed without issue.
However, with either of the following SparkSessions:
spark = (SparkSession.builder.master('yarn')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
or
spark = (SparkSession.builder.master('local')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
I get the following error.
ImportError: ('No module named bigdl.util.common', <function subimport at 0x7fd442a36aa0>, ('bigdl.util.common',))
In addition of the 'spark.submit.pyFiles' config above, after the SparkSession successfully starts, I have tried spark.sparkContext.addPyFile("util.zip") where "util.zip" contains all of the python files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl/util .
I have also zipped all of the contents in this folder https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8) and pointed to that file in the .config('spark.submit.pyFiles', '/path/to/bigdl.zip'), but this also does not work.
How do I get the SparkSession to see these files?
Figured it out. The only thing that worked was spark.sparkContext.addPyFile("bigdl.zip") after the SparkSesssion has started. Where "bigdl.zip" contained all of the files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8).
Not sure why .config('spark.submit.pyFiles', 'bigdl.zip') would not work.

PySpark ImportError: No module named although included in --pyfiles

I'm trying to run a PySpark application. the spark submit command looks something like this.
spark-submit --py-files /some/location/data.py /path/to/the/main/file/etl.py
My main file(etl.py) imports the data.py and uses the functions from data.py file, the code looks like this.
import data
def main(args_dict):
print(args_dict)
df1 = data.get_df1(args_dict['df1name'])
df1 = data.get_df2(args_dict['df1name'])
...
...
...
I'm passing the data.py file in the --py-files, but when I run the spark-submit I'm getting ImportError: No module named 'data'
I'm trying to figure out what it is that I'm doing wrong here. Thank you.

Pyspark running external program using subprocess can't read files from hdfs

I'm trying to run an external program(such as bwa) within pyspark. My code looks like this.
import sys
import subprocess
from pyspark import SparkContext
def bwaRun(args):
a = ['/home/hd_spark/tool/bwa-0.7.13/bwa', 'mem', ref, args]
result = subprocess.check_output(a)
return result
sc = SparkContext(appName = 'sub')
ref = 'hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta'
input = 'hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq'
chunk_name = []
chunk_name.append(input)
data = sc.parallelize(chunk_name,1)
print data.map(bwaRun).collect()
I'm running spark with yarn cluster with 6 nodes of slaves and each node has bwa program installed. When i run the code, bwaRun function can't read input files from hdfs. Its kind of obvious this doesn't work because when i tried to run bwa program locally by giving
bwa mem hdfs://Master:9000/user/hd_spark/spark/ref/human_g1k_v37_chr13_26577411_30674729.fasta hdfs://Master:9000/user/hd_spark/spark/chunk_interleaved.fastq
on the shell didn't work because it can't read files from hdfs.
Can anyone give me idea how i could solve this?
Thanks in advance!

Resources