How to get the SparkSession to find added python files - apache-spark

After running pip install BigDL==0.8.0, running from bigdl.util.common import * from python completed without issue.
However, with either of the following SparkSessions:
spark = (SparkSession.builder.master('yarn')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
or
spark = (SparkSession.builder.master('local')
.appName('test')
.config("spark.jars", "/BigDL/spark/dl/target/bigdl-0.8.0-jar-with-dependencies-and-spark.jar")
.config('spark.submit.pyFiles', '/BigDL/pyspark/bigdl/util.zip')
.getOrCreate()
)
I get the following error.
ImportError: ('No module named bigdl.util.common', <function subimport at 0x7fd442a36aa0>, ('bigdl.util.common',))
In addition of the 'spark.submit.pyFiles' config above, after the SparkSession successfully starts, I have tried spark.sparkContext.addPyFile("util.zip") where "util.zip" contains all of the python files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl/util .
I have also zipped all of the contents in this folder https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8) and pointed to that file in the .config('spark.submit.pyFiles', '/path/to/bigdl.zip'), but this also does not work.
How do I get the SparkSession to see these files?

Figured it out. The only thing that worked was spark.sparkContext.addPyFile("bigdl.zip") after the SparkSesssion has started. Where "bigdl.zip" contained all of the files in https://github.com/intel-analytics/BigDL/tree/master/pyspark/bigdl (branch-0.8).
Not sure why .config('spark.submit.pyFiles', 'bigdl.zip') would not work.

Related

Importing package in client mode PYSPARK

I need use virtual environment in pyspark EMR cluster.
I am launching application with spark-sumbit using the following configuration.
spark-submit --deploy-mode client --archives path_to/environment.tar.gz#environment --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python
Environment variables are setting in python script. The code is working importing packages inside spark context function, but i wannt import outside the function. What's wrong?
from pyspark import SparkConf
from pyspark import SparkContext
import pendulum #IMPORT ERROR!!!!!!!!!!!!!!!!!!!!!!!
os.environ["PYSPARK_PYTHON"] = "./environment/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "which python"
conf = SparkConf()
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def some_function(x):
import pendulum #IMPORT CORRECT
dur = pendulum.duration(days=x)
# More properties
# Use the libraries to do work
return dur.weeks
rdd = (sc.parallelize(range(1000))
.map(some_function)
.take(10))
print(rdd)
import pendulum #IMPORT ERROR
from the looks of it, seems you are using the correct environment for the workers ("PYSPARK_PYTHON") but override the environment for the driver ("PYSPARK_DRIVER_PYTHON") so that the driver's python does not see the package you wanted to import. the code in the some_function gets executed by the workers ( never by the driver), and thus it can see the imported package, while the latest import is executed by the driver, and fails.

Pyspark module not found error even after passing module as zip file

I have a Pyspark code repo which i am sending to the spark session as a zip file through --pyFile parameter. I am doing this because there is a UDF defined in one of the python files within the module which is not available when we run the code as the module is not available in the workers.
Even though all the python files are present inside the zip file i still get the module not found error.
|-Module
|----test1.py
|----test2.py
|----test3.py
When i try to from Module.test2 import foo to import in test3.py i get an error that module.test2 is not found. test2 contains an pyspark UDF.
Any help would be greatly appreciated
from pyspark.sql import SparkSession
Initiate a spark session
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
Remove the config - if not needed.
Make sure that the Module.zip is in the same directory as main.py
spark.sparkContext.addPyFile("Module.zip")
Dependency imports after adding the package.zip path to SparkSession.
from Module.test2 import foo
Finally
|-main.py - Core code to be executed.
|-Module.zip - Dependency module zipped.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName(self.app_name)\
.master(self.master)\
.config(conf=conf)\
.getOrCreate()
spark.sparkContext.addPyFile("Module.zip")
from Module.test2 import foo
IMPORTANT - Make sure you correctly zip you file - if the problem persists try other zip methods.

PySpark ImportError: No module named although included in --pyfiles

I'm trying to run a PySpark application. the spark submit command looks something like this.
spark-submit --py-files /some/location/data.py /path/to/the/main/file/etl.py
My main file(etl.py) imports the data.py and uses the functions from data.py file, the code looks like this.
import data
def main(args_dict):
print(args_dict)
df1 = data.get_df1(args_dict['df1name'])
df1 = data.get_df2(args_dict['df1name'])
...
...
...
I'm passing the data.py file in the --py-files, but when I run the spark-submit I'm getting ImportError: No module named 'data'
I'm trying to figure out what it is that I'm doing wrong here. Thank you.

Spark No module named found

I have a simple spark program and I get the following error -
Error:-
ImportError: No module named add_num
Command used to run :-
./bin/spark-submit /Users/workflow/test_task.py
Code:-
from __future__ import print_function
from pyspark.sql import SparkSession
from add_num import add_two_nos
def map_func(x):
print(add_two_nos(5))
return x*x
def main():
spark = SparkSession\
.builder\
.appName("test-task")\
.master("local[*]")\
.getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5]) # parallelize into 2
rdd = rdd.map(map_func) # call the image_chunk_func
print(rdd.collect())
spark.stop()
if __name__ == "__main__":
main()
function code:-
def add_two_nos(x):
return x*x
You can specify the .py file form which you wish to import in the code itself by adding a statement sc.addPyFile(Path). The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
Then use from add_num import add_two_nos
You need to include a zip containing add_num.py in your spark-submit command.
./bin/spark-submit --py-files sources.zip /Users/workflow/test_task.py
When submitting a python application to spark, all the source files imported by the main function/file(here test_task.py) should be packed in a egg or zip format and supplied to spark using --py-files option. If the main function needs only one other file, you can supply it directly without zipping it.
./bin/spark-submit --py-files add_num.py /Users/workflow/test_task.py
Above command should also work since there is only one other python source file required.

Pyspark - FileInputDStream: Error finding new files

Hi I'm new to Python Spark and I'm trying out this example from Spark github in order to Counts words in new text files created in the given directory :
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: hdfs_wordcount.py <directory>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingHDFSWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream("hdfs:///home/my-logs/")
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda x: (x, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
And this is what I get :
a warning saying : WARN FileInputDStream: Error finding new files
a warning message saying : WARN FileInputDStream: Error finding new files.
and I got empty results even i'm adding files in this dir :/
Any suggested solution for this ?
thanks.
The issue is spark streaming will not read old files from directory..since all logs files exist before your streaming job started
so what you need to do once you started your streaming job then put/copy input files in hdfs directory either manually or by an script.
I think you are referring to this example. Are you able to run it without modifying as I see you are setting directory to "hdfs:///" in program? You can run the example like below.
For example Spark is at /opt/spark-2.0.2-bin-hadoop2.7. You can run hdfs_wordcount.py available in example directory like below. We are using /tmp as directory to pass as argument to program.
user1#user1:/opt/spark-2.0.2-bin-hadoop2.7$ bin/spark-submit examples/src/main/python/streaming/hdfs_wordcount.py /tmp
Now while this program is running, open another terminal and copy some file to /tmp folder
user1#user1:~$ cp test.txt /tmp
You will see the word count in first terminal.
Solved!
The issue is the build, i use to build like that using maven depending on their readme file from github :
build/mvn -DskipTests clean package
I've build that way depending on their documentation :
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Someone know what those params are ?

Resources