Install spark-nlp with GPU - apache-spark

I'm newbie in pyspark and spark-nlp and i want to use spark-nlp in docker container with GPU support on WSL-2 Windows 10.
After installing spark-nlp I can use pretrained models and pipelines, but there is no difference between CPU and GPU speed. Nvidia-smi shows that model is loaded into GPU memory.
Can you please tell me what versions of libraries i have to install or what kind of problem is this.
Thanks

you have 2 options for setting up GPU on Spark-NLP according to how you're starting the session,
import sparknlp
spark = sparknlp.start(gpu=True)
or by passing this
spark = SparkSession.builder \
.appName("Spark NLP")\
.master("local[*]")\
.config("spark.driver.memory","16G")\
.config("spark.driver.maxResultSize", "0") \
.config("spark.kryoserializer.buffer.max", "2000M")\
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2")\
.getOrCreate()
Also the use of GPU will depend on which model you're using, and the dataset size.
So don't expect an automatic speed up.

Related

Spark exception 5063 in TensorFlow extended example code on GPU

I am trying to run an example code of TensorFlow extended at https://www.tensorflow.org/tfx/tutorials/transform/census
on databricks GPU cluster.
My env:
7.1 ML Spark 3.0.0 Scala 2.12 GPU
python 3.7
tensorflow: Version: 2.1.1
tensorflow-transform==0.22.0
apache_beam==2.21.0
When I run
transform_data(train, test, temp)
I got error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
It seems that this is a known issue of RDD on Spark.
https://issues.apache.org/jira/browse/SPARK-5063
I tried to search some solutions here, but none of them work for me.
how to deal with error SPARK-5063 in spark
At the example code, I do not see where SparkContext is accessed from worker explicitly.
It is called from Apache Beam ?
Thanks

Notebook vs spark-submit

I'm very new to PySpark.
I am running a script (mainly creating a tfidf and predicting 9 categorical columns with it) in Jupyter Notebook. It is taking some 5 mins when manually executing all cells. When running the same script from spark-submit it is taking some 45 mins. What is happening?
Also the same thing happens (the excess time) if I run the code using python from terminal.
I am also setting the configuration in the script as
conf = SparkConf().set('spark.executor.memory', '45G').set('spark.driver.memory', '80G').set('spark.driver.maxResultSize', '20G')
Any help is appreciated. Thanks in advance.
There are various ways to run your Spark code like you have mentioned few Notebook, Pyspark and Spark-submit.
Regarding Jupyter Notebook or pyspark shell.
While you are running your code in Jupyter notebook or pyspark shell it might have set some default values for executor memory, driver memory, executor cores etc.
Regarding spark-submit.
However, when you use Spark-submit these values could be different by default. So the best way would be to pass these values as flags while submitting the pyspark application using "spark-submit" utility.
Regarding the configuration object which you have created can pe be passes while creating the Spark Context (sc).
sc = SparkContext(conf=conf)
Hope this helps.
Regards,
Neeraj
I had the same problem, but to initialize my spark variable I was using this line :
spark = SparkSession.builder.master("local[1]").appName("Test").getOrCreate()
The problem is that "local[X]", is equivalent to say that spark will do the operations on the local machine, on X cores. So you have to optimize X with the number of cores available on your machine.
To use it with a yarn cluster, you have to put "yarn".
There is many others possibilities listed here : https://spark.apache.org/docs/latest/submitting-applications.html

How to have Apache Spark running on GPU?

I want to integrate apache spark with GPU but spark works on java while gpu uses CUDA/OpenCL so how do we merge them.
It depends on what you want to do. If you want to distribute your computation with GPUs using spark you don't necessary have to use java. You could use python (pyspark) with numba which have a cuda module.
For exemple you can apply this code if you want your worker nodes to compute operation (here gpu_function) on every blocks of your RDD.
rdd = rdd.mapPartition(gpu_function)
with :
def gpu_function(x):
...
input = f(x)
output = ...
gpu_cuda[grid_size,block_size](input,output)
return output
and :
from numba import cuda
#cuda.jit("(float32[:],float32[:])")
def gpu_cuda(input,output)
output = g(input)
I advise you to take a look at the slideshare url : https://fr.slideshare.net/continuumio/gpu-computing-with-apache-spark-and-python ,specificly slide 34.
You only need numba and cuda driver install on every worker node.
There is a few libraries that helps with this dilema.
The Databricks is working in a solution for Spark with TensorFlow that will allow you to use the GPUs of your cluster, or your machine.
If you want to find more about that there is a presentation of Spark Summit Europe 2016 This presentation will show a little bit how TensorFrames works.
Other this is a post about TensoFrames in DataBricks Blog.
And for more code information see the Git of Tensorframes.

Why do I see only 200 tasks in stages?

I have a spark cluster with 8 machines, 256 cores, 180Gb ram per machine. I have started 32 executors, with 32 cores and 40Gb ram each.
I am trying to optimize a complex application and I notice that a lot of the stages have 200 tasks. This seems sub-optimal in my case. I have tried setting the parameter spark.default.parallelism to 1024 but it appears to have no effect.
I run spark 2.0.1, in stand alone mode, my driver is hosted on a workstation running inside a pycharm debug session. I have set spark.default.parallelism in:
spark-defaults.conf on workstation
spark-defaults.conf on the cluster spark/conf directory
in the call to build the SparkSession on my
driver
This is that call
spark = SparkSession \
.builder \
.master("spark://stcpgrnlp06p.options-it.com:7087") \
.appName(__SPARK_APP_NAME__) \
.config("spark.default.parallelism",numOfCores) \
.getOrCreate()
I have restarted the executors since making these changes.
If I understood this correctly, having only 200 task in a stage means that my cluster is not being fully utilized?
When I watch the machines using htop I can see that I'm not getting full CPU usage. Maybe on one machine at one time, but not on all of them.
Do I need to call .rdd.repartition(1024) on my dataframes? Seems like a burden to do that everywhere.
Try Setting in this configuration:
set("spark.sql.shuffle.partitions", "8")
Where 8 is the number of partitions that you want to make.
or SparkSession,
.config("spark.sql.shuffle.partitions", "2")

How to train word2vec model efficiently in the spark cluster environment?

I want to train word2vec model about 10G news corpus on my Spark cluster.
The following is the configration of my spark cluster:
One Master and 4 Worker
each with 80G memory and 24 Cores
However I find training Word2vec using Spark Mllib does't take full advantage of the cluster's resource.
For example:
the pic of top command in ubuntu
As the above picture shows,only 100% cpu is used in a worker,the other three worker is not in use(so not paste the their picture) and Just now I how trained a word2vec model about 2G news corpus,It takes about 6h,So I want to know how to train the model more efficiently?Thank everyone in advance:)
UPDATE1:the following command is what I used in the spark-shell
how to start spark-shell
spark-shell \
--master spark://ip:7077 \
--executor-memory 70G \
--driver-memory 70G \
--conf spark.akka.frameSize=2000 \
--conf spark.driver.maxResultSize=0 \
--conf spark.default.parallelism=180
the following command is what I used to train word2vec model in the spark-shell:
//import related packages
import org.apache.spark._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
//read about 10G newsdata corpus
val newsdata = sc.textFile("hdfs://ip:9000/user/bd/newsdata/*",600).map(line => line.split(" ").toSeq)
//Configure word2vec parameters
val word2vec = new Word2Vec()
word2vec.setMinCount(10)
word2vec.setNumIterations(10)
word2vec.setVectorSize(200)
//train the model
val model = word2vec.fit(newsdata)
UPDATE2:
I have train the model for about 24h and it doesn't complete. The cluster is running like this:
only 100% cpu is used in a worker,the other three worker is not in use as before.
I experienced a similar problem in Python when training a Word2Vec model. Looking at the PySpark docs for word2vec here, it reads:
setNumIterations(numIterations) Sets number of iterations
(default: 1), which should be smaller than or equal to number of
partitions.
New in version 1.2.0.
setNumPartitions(numPartitions)Sets number of partitions
(default: 1). Use a small number for accuracy.
New in version 1.2.0.
My word2vec model stopped hanging, and Spark stopped running out of memory when I increased the number of partitions used by the model so that numIterations <= numPartitions
I suggest you set word2vec.setNumIterations(1) or word2vec.setNumPartitions(10).
As your model is taking too long to train, I think you should first try and understand how spark actually benefits the model training part. As per this paper,
Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty
Spark mllib's libraries remove this performance penalty by caching the data in memory during the first iteration. So subsequent iterations are extremely quick compared to the first iteration and hence, there is a significant reduction in model training time. I think, in your case, the executor memory might be insufficient to load a partition of data in memory. Hence contents would be spilled to disk and would need to be fetched from disk again in every iteration, thus killing any performance benefits of spark. To make sure, this is actually the case, you should try and look at the executor logs which would contain some lines like "Unable to store rdd_x_y in memory".
If this is indeed the case, you'll need to adjust --num-executors, --executor-memory and numPartitions to see which values of these parameters are able to load the entire data into memory. You can try out with a small data set, single executor and a small value of executor memory on your local machine and analyze logs while incrementally increasing executor memory to see at which config the data is totally cached in memory. Once you have the configs for the small data set, you can do the Maths to figure out how many executors with how much memory are required and what should be the number of partitions for the required partition size.
I had faced a similar problem and managed to bring down model training time from around 4 hours to 20 minutes by following the above steps.

Resources