Scenario :
I have setup a spark cluster on my kubernetes environment :
Livy Pod for submission of jobs
Spark Master Pod
Spark Worker Pod for execution
What I want to achieve is as follows:
I have a jupyter notebook with a Pyspark kernel as a pod in the same environment wherein on the execution of cells a spark session is created and using livy post request /statements all my code gets executed. I was able to achieve the above scenario
Note : There is no YARN, HDFS, Hadoop in my env. I have made use of kubernetes, spark standalone and jupyter only.
Issue :
Now what I wanted, is when I run my pyspark code and it gets executed over in the spark worker, I would like to send the following over in that execution environment :
environment variables which I have used in the notebook
pip packages which I have used in the notebook
or a custom virtualenv in which i could provide all the packages
used together
I am unable to do the same.
Things that I have tried out so far :
Since I made use of spark magic, have tried to set environment variables using the following ways I could find in the documentations and other answers.
%%configure {
"conf": {
spark.executorEnv.TESTVAR
spark.appMasterEnv.TESTVAR
spark.driver.TESTVAR
spark.driverenv.TESTVAR
spark.kubernetes.driverenv.TESTVAR
spark.kubernetes.driver.TESTVAR
spark.yarn.executorEnv.TESTVAR
spark.yarn.appMasterEnv.TESTVAR
spark.workerenv.TESTVAR
}
}
Bunching up for reference, I have tried the above options individually.
I have tried directly hitting the livy pod's service name like a normal post request but still no luck.
But the variables are still not getting detected
After this I tried directly setting the same manually in spark-defaults.conf in the spark cluster but did not work.
Would appreciate any help on the matter. Also is my first SO question so please let know incase of issues.
Related
I'm trying to use the following Helm Chart for Spark on Kubernetes
https://github.com/bitnami/charts/tree/main/bitnami/spark
The documentation is of course spotty but I've muddled along. So I have it installed with custom values that assign things like resource limits etc. I'm accessing the master through a NodePort and the WebUI through a port forward. I am NOT using spark-submit, I'm writing Python code to drive the Spark Cluster as follows:
import pyspark
sc = pyspark.SparkContext(appName="Testy", master="spark://<IP>:<PORT>")
This Python code is running locally on my Windows laptop, the Kubernetes cluster is on a separate set of servers. It connects and I can see the app appear in the WebUI but the second it tries to do something I get the following:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
The master seems to be in a cycle of removing and launching executors and the 3 workers each just fail to run a launch command. Interestingly the command has the hostname of my laptop in here:
"--driver-url" "spark://CoarseGrainedScheduler#<laptop hostname>:60557"
Got to imagine that's not right. So in this setup where should I be actually running the python code? On the kubernetes cluster? Can I run it locally on my laptop? These details are of course missing from the docs. I'm new to Spark so just looking for the absolute basics. My preferred workflow would be to develop code locally on my laptop then run it on the Kubernetes cluster I have access to.
I am looking for help in running hop pipelines on Spark cluster, running on kubernetes.
I have spark master deployed with 3 worker nodes on kubernetes
I am using hop-run.sh command to run pipeline on spark running on kubernetes.
Facing Below exception
-java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.services.s3.AmazonS3ClientBuilder
Looks like fat.jar is not getting associated with the spark when running hop-run.sh command.
I tried running same with spark-submit command too but not sure how to pass references of pipelines and workflows to Spark running on kubernetes, though I am able to add fat jar to the classpath (can be seen in logs)
Any kind of help is appreciated.
Thanks
like
Could it be that you are using version 1.0?
We had a missing jar for S3 VFS which has been resolved in 1.1
https://issues.apache.org/jira/browse/HOP-3327
For more information on how to use spark-submit you can take a look at the following documentation:
https://hop.apache.org/manual/latest/pipeline/pipeline-run-configurations/beam-spark-pipeline-engine.html#_running_with_spark_submit
The location to the fat-jar the pipeline and the required metadata-export can all be VFS locations so no need to place those on the cluster itself.
I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop.
I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.
Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)
I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.
There are a variety of options for remotely performing spark-submit via Airflow.
Emr-Step
Apache-Livy (see this for hint)
SSH
Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.
I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.
I have an external custom jar that I would like to use with Azure HDInsight Jupyter notebooks; the Jupyter notebooks in HDI use Spark Magic and Livy.
Within the first cell of the notebook, I'm trying to use the jars configuration:
%%configure -f
{"jars": ["wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar"]}
But the error message I receive is:
Starting Spark application
The code failed because of a fatal error:
Status 'shutting_down' not supported by session..
Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context. For instructions on how to assign resources see http://go.microsoft.com/fwlink/?LinkId=717038
b) Contact your cluster administrator to make sure the Spark magics library is configured correctly.
Current session configs: {u'jars': [u'wasb://$container$#$account#.blob.core.windows.net/folder/my-custom-jar.jar'], u'kind': 'spark'}
An error was encountered:
Status 'shutting_down' not supported by session.
I'm wondering if I'm just not understanding how Livy works in this case as I was able to successfully include a spark-package (GraphFrames) on the same cluster:
%%configure -f
{ "conf": {"spark.jars.packages": "graphframes:graphframes:0.3.0-spark2.0-s_2.11" }}
Some additional references that may be handy (just in case I missed something):
Jupyter notebooks kernels with Apache Spark clusters in HDInsight
Livy Documentation
Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy
Oh, I was able to figure it out and forgot to update my question. This can work if you put the jar in the default storage account of your HDI cluster.
HTH!
in case people come here for adding jars on EMR.
%%configure -f
{"name": "sparkTest", "conf": {"spark.jars": "s3://somebucket/artifacts/jars/spark-avro_2.11-2.4.4.jar"}}
contrary to the document, use jars directly won't work.