How to set python library constant in Spark/databricks cluster? - apache-spark

I have a python databricks notebook doing batch inference with fastai v2 via a pandas UDF. This throws PIL.Image.DecompressionBombError on some large images.
Setting
Image.MAX_IMAGE_PIXELS = None
sets it only on the driver. How does one set it globally across the cluster to avoid PIL.Image.DecompressionBombError on the workers?

Related

Can the memory/compute capacity of the python instance setup by spark executors in PySpark be controlled

According to this answer, python instances are set up when foreachPartition, mapPartitions functions are used in the nodes where the executors run. How are the memory / compute capacities of these instances set? Do we get to control it through some configuration?

Spark exception 5063 in TensorFlow extended example code on GPU

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

Pyspark reading caffe models from HDFS

I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.
But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:
Adding the files to HDFS, and using addfile or --file when submitting jobs
sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each worker node using
model_weight =SparkFiles.get('test.caffemodel')
net = caffe.Net(model_define, model_weight, caffe.TEST)
Since SparkFiles.get() will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:
ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.
Updated:
I found that if the path in sc.addFile(path) is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.
My questions are
Is any other possibility that will cause the above exception other
than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )
If I do not add the file when submmit, how do I read the model file
from HDFS using PySpark? (So that I can reconstruct model using
caffe API). Or is there any way to get the path other than
SparkFiles.get()?
Any suggestions will be appreciated!!

How to have Apache Spark running on GPU?

I want to integrate apache spark with GPU but spark works on java while gpu uses CUDA/OpenCL so how do we merge them.
It depends on what you want to do. If you want to distribute your computation with GPUs using spark you don't necessary have to use java. You could use python (pyspark) with numba which have a cuda module.
For exemple you can apply this code if you want your worker nodes to compute operation (here gpu_function) on every blocks of your RDD.
rdd = rdd.mapPartition(gpu_function)
with :
def gpu_function(x):
...
input = f(x)
output = ...
gpu_cuda[grid_size,block_size](input,output)
return output
and :
from numba import cuda
#cuda.jit("(float32[:],float32[:])")
def gpu_cuda(input,output)
output = g(input)
I advise you to take a look at the slideshare url : https://fr.slideshare.net/continuumio/gpu-computing-with-apache-spark-and-python ,specificly slide 34.
You only need numba and cuda driver install on every worker node.
There is a few libraries that helps with this dilema.
The Databricks is working in a solution for Spark with TensorFlow that will allow you to use the GPUs of your cluster, or your machine.
If you want to find more about that there is a presentation of Spark Summit Europe 2016 This presentation will show a little bit how TensorFrames works.
Other this is a post about TensoFrames in DataBricks Blog.
And for more code information see the Git of Tensorframes.

Resources