Spark exception 5063 in TensorFlow extended example code on GPU - apache-spark

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

Related

How to access java runtime variables like java.lang.Runtime.getRuntime().maxMemory() for pyspark executors?

The question is all there is. I want a way to check the java runtime variables for the executor jvm created but I am working with pyspark. How can I access java.lang.Runtime.getRuntime().maxMemory() if I am working with pyspark?
based on the comment I have tried to run the following code but both approaches are unsuccessful
#created a RDD
l = sc.range(100)
Now, I have to run func = sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory() on each executor. So, I do the following
l.map(lambda x:sc._gateway.jvm.java.lang.Runtime.getRuntime().maxMemory()).collect()
Which results in
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
The spark context can only be used on the driver
I also tried
func = sc._gateway.jvm.java.lang.Runtime.getRuntime()
l.map(lambda x:func.maxMemory()).collect()
which results in the following error
TypeError: cannot pickle '_thread.RLock' object

Notebook to write Java jobs for Spark

I am writing my first Spark job using Java API.
I want to run it using a notebook.
I am looking into Zeppelin and Jupyter.
At Zeppelin documentation I see support for Scala, IPySpark and SparkR. It is not clear to me whether using two interpreters %spark.sql %java will allow me to work with Java API of Spark SQL
Jupyter has "IJava" kernel but I see no support for Spark with Java.
Are there other options?
#Victoriia Zeppelin 0.9.0 has %java interpreter with example here
zeppelin.apache.org
I try to start with it in GoogleCloud, but had some problems...
use magic command %jars path/to/spark.jar in the IJava cell, according to the IJava's author
then take a look on import org.apache.spark.sql.* for example

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set to true, SparkContext can be created in executors.
But I wonder that, How can SparkContext be created in executors? As far as I know SparkContext is created on driver, and executors are assigned by resource manager. So SparkContext is always created before executors.
What is the use case of this config?
From the Spark Core migration 3.0 to 3.1:
In Spark 3.0 and below, SparkContext can be created in executors.
Since Spark 3.1, an exception will be thrown when creating
SparkContext in executors. You can allow it by setting the
configuration spark.executor.allowSparkContext when creating
SparkContext in executors.
As per this issue SPARK-32160, since version 3.1 there is a check added when creating SparkContext (see for pyspark pyspark/context.py) which prevents executors from creating SparkContext:
if (conf is None or
conf.get("spark.executor.allowSparkContext", "false").lower() != "true"):
# In order to prevent SparkContext from being created in executors.
SparkContext._assert_on_driver()
# ...
#staticmethod
def _assert_on_driver():
"""
Called to ensure that SparkContext is created only on the Driver.
Throws an exception if a SparkContext is about to be created in executors.
"""
if TaskContext.get() is not None:
raise Exception("SparkContext should only be created and accessed on the driver.")
An error in the docs and, or implementation I suggest.
The whole concept makes no sense if you (as you do) understand the Spark architecture. No announcement has been made otherwise about this.
From the other answer and plentiful doc of errors on this aspect it is clear something went awry.

How to have Apache Spark running on GPU?

I want to integrate apache spark with GPU but spark works on java while gpu uses CUDA/OpenCL so how do we merge them.
It depends on what you want to do. If you want to distribute your computation with GPUs using spark you don't necessary have to use java. You could use python (pyspark) with numba which have a cuda module.
For exemple you can apply this code if you want your worker nodes to compute operation (here gpu_function) on every blocks of your RDD.
rdd = rdd.mapPartition(gpu_function)
with :
def gpu_function(x):
...
input = f(x)
output = ...
gpu_cuda[grid_size,block_size](input,output)
return output
and :
from numba import cuda
#cuda.jit("(float32[:],float32[:])")
def gpu_cuda(input,output)
output = g(input)
I advise you to take a look at the slideshare url : https://fr.slideshare.net/continuumio/gpu-computing-with-apache-spark-and-python ,specificly slide 34.
You only need numba and cuda driver install on every worker node.
There is a few libraries that helps with this dilema.
The Databricks is working in a solution for Spark with TensorFlow that will allow you to use the GPUs of your cluster, or your machine.
If you want to find more about that there is a presentation of Spark Summit Europe 2016 This presentation will show a little bit how TensorFrames works.
Other this is a post about TensoFrames in DataBricks Blog.
And for more code information see the Git of Tensorframes.

stop all existing spark contexts

I am trying to create a new Spark context using pyspark, and i get the following:
WARN SparkContext: Another SparkContext is being constructed (or threw
an exception in its constructor). This may indicate an error, since
only one SparkContext may be running in this JVM (see SPARK-2243). The
other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
I do not have any other context active (in theory), but maybe it did not finish correctly and it is still there. How can I find out if there is other or kill all the current ones? I am using spark 1.5.1
When you run pyspark shell and execute python script inside it, e.g., using 'execfile()', SparkContext available as sc, HiveContext available as sqlContext. To run a python script without any contexts just use ./bin/spark-submit 'your_python_file'.

Resources