How to have Apache Spark running on GPU? - apache-spark

I want to integrate apache spark with GPU but spark works on java while gpu uses CUDA/OpenCL so how do we merge them.

It depends on what you want to do. If you want to distribute your computation with GPUs using spark you don't necessary have to use java. You could use python (pyspark) with numba which have a cuda module.
For exemple you can apply this code if you want your worker nodes to compute operation (here gpu_function) on every blocks of your RDD.
rdd = rdd.mapPartition(gpu_function)
with :
def gpu_function(x):
...
input = f(x)
output = ...
gpu_cuda[grid_size,block_size](input,output)
return output
and :
from numba import cuda
#cuda.jit("(float32[:],float32[:])")
def gpu_cuda(input,output)
output = g(input)
I advise you to take a look at the slideshare url : https://fr.slideshare.net/continuumio/gpu-computing-with-apache-spark-and-python ,specificly slide 34.
You only need numba and cuda driver install on every worker node.

There is a few libraries that helps with this dilema.
The Databricks is working in a solution for Spark with TensorFlow that will allow you to use the GPUs of your cluster, or your machine.
If you want to find more about that there is a presentation of Spark Summit Europe 2016 This presentation will show a little bit how TensorFrames works.
Other this is a post about TensoFrames in DataBricks Blog.
And for more code information see the Git of Tensorframes.

Related

Spark exception 5063 in TensorFlow extended example code on GPU

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

How can I utilize the driver node GPU with Horovod on an Azure Databricks cluster?

When I create a cluster with one driver + two workers, with one GPU each, and try to launch training on each GPU I would write:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=3)
hr.run(train_hvd)
But receive the following error message:
HorovodRunner was called with np=3, which is greater than the maximum processes that can be placed
on this cluster. This cluster can place at most 2 processes on 2 executors. Training won't start
until there are enough workers on this cluster. You can increase the cluster size or cancel the
current run and retry with a smaller np.
Apparently HorovodRunner does not consider the GPU on the driver node (correct?). When I use the options np=-1 (driver GPU only), np=2 (2 GPUs somewhere), or np=-2 (driver-only but with 2 GPUs) everything works fine, i.e. there is nothing functionally wrong with my code, besides that I cannot get it to utilize all 3 available GPUs.
(a) Is there a way to make Horovod include the GPUs on the driver node in distributed learning?
(b) Alternatively: is there a way to create a cluster with GPU workers but a non-GPU driver in Databricks?
To run HorovodRunner on the driver only with n subprocesses, use hr = HorovodRunner(np=-n). For example, if there are 4 GPUs on the driver node, you can choose n up to 4.
Parameters: np –
number of parallel processes to use for the Horovod job. This argument only takes effect on Databricks Runtime 5.0 ML and above. It is ignored in the open-source version. On Databricks, each process will take an available task slot, which maps to a GPU on a GPU cluster or a CPU core on a CPU cluster. Accepted values are:
If <0, this will spawn -np subprocesses on the driver node to run Horovod locally. Training stdout and stderr messages go to the notebook cell output, and are also available in driver logs in case the cell output is truncated. This is useful for debugging and we recommend testing your code under this mode first. However, be careful of heavy use of the Spark driver on a shared Databricks cluster. Note that np < -1 is only supported on Databricks Runtime 5.5 ML and above.
If >0, this will launch a Spark job with np tasks starting all together and run the Horovod job on the task nodes. It will wait until np task slots are available to launch the job. If np is greater than the total number of task slots on the cluster, the job will fail. As of Databricks Runtime 5.4 ML, training stdout and stderr messages go to the notebook cell output. In the event that the cell output is truncated, full logs are available in stderr stream of task 0 under the 2nd spark job started by HorovodRunner, which you can find in the Spark UI.
If 0, this will use all task slots on the cluster to launch the job.
You can find details about the parameter np in the HorovodRunner API documentation and "HorovodRunner: Distributed Deep Learning with Horovod".
Hope this helps.

Does python multiprocessing work with Hadoop streaming?

In Hadoop streaming - where the Mapper and Reducer are written in python - Does it help to make the Mapper process use the multiprocessing module? Or does the scheduler prevent the Mapper scripts from running on multiple threads on the compute nodes?
In classic MapReduce there is nothing that stops you from having multiple threads in a mapper or a reducer. The same is true for Hadoop Streaming, you can very well have multiple threads per mapper or reducer. This situation can happen if you have a CPU heavy job and want to speed it up.
If you're doing Hadoop Streaming with Python, you can use the multiprocessing module to speed up your mapper phase.
Note that depending on the way your Hadoop cluster is configured (how many JVM mapper/reducer per nodes) you may have to adjust the maximum number of processes you can use.

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

PySpark: pull data to driver and then upload to dataframe

I am trying to create a pyspark dataframe from data stored in an external database. I use the pyodbc module to connect to the database and pull the required data, after which I use spark.createDataFrame to send my data to the cluster for analysis.
I run the script using --deploy-mode client, so the driver runs on the master node, but the executors can be distributed to other machines. The problem is pyodbc is not installed on any of the worker nodes (this is fine since I don't want them all querying the database anyway), so when I try to import this module in my scripts, I get an import error (unless all the executors happen to be on the master node).
My question is how can I specify that I want a certain portion of my code (in this case, importing pyodbc and querying the database) to run on the driver only? I am thinking something along the lines of
if __name__ == '__driver__':
<do stuff>
else:
<wait until stuff is done>
Your imports in your python driver DO only run on the master. The only time you will see errors on your executors about missing imports is if you are referencing some object/function from one of those imports in a function you are calling on a driver. I would look carefully at any python code you are running in RDD/DataFrame calls for unintended references. If you post your code, we can give you more specific guidance.
Also, routing data through your driver is usually not a great idea because it will not scale well. If you have lots of data you are going to try and force all through a single point which defeats the purpose of distributed processing!
Depending on what database you are using is, there is probably a Spark Connector implemented to load it directly into a dataframe. If you are using ODBC then maybe you are using SQL Server? For example, in that case you should be able to use JDBC drivers, like for example in this post:
https://stephanefrechette.com/connect-sql-server-using-apache-spark/#.Wy1S7WNKjmE
This is not how spark is supposed to work. Spark collections (RDDs or DataFrames) are inherently distributed. What you're describing is to create a dataset locally, by reading the whole dataset into drivers memory, and then sending it over to executors for further processing by creating an RDD or DataFrame out of it. That does not make much sense.
If you want to make sure that there is only one connection from spark to your database, then set the parallelism to 1. You can then increase the parallelism in further transformation steps.

Resources