Running a Spark application on YARN, without spark-submit - apache-spark

I know that Spark applications can be executed on YARN using spark-submit --master yarn.
The question is:
is it possible to run a Spark application on yarn using the yarn command ?
If so, the YARN REST API could be used as interface for running spark and MapReduce applications in a uniform way.

I see this question is a year old, but to anyone else who stumbles across this question it looks like this should be possible now. I've been trying to do something similar and have been attempting to follow the Starting Spark jobs directly via YARN REST API Tutorial from Hortonworks.
Essentially what you need to do is upload your jar to HDFS, create a Spark Job JSON file per the YARN REST API Documentation, and then use a curl command to start the application. An example of that command is:
curl -s -i -X POST -H "Content-Type: application/json" ${HADOOP_RM}/ws/v1/cluster/apps \
--data-binary spark-yarn.json

Just like all YARN Applications, Spark implements a Client and an ApplicationMaster when deploying on YARN. If you look at the implementation in the Spark repository, you'll have a clue as to how to create your own Client/ApplicationMaster :
https://github.com/apache/spark/tree/master/yarn/src/main/scala/org/apache/spark/deploy/yarn . But out of the box it does not seem possible.

I have not seen the lates package, but few months back such thing was not possible "out of the box" (this is info straight from cloudera support). I know it's not what you were hoping for, but that's what I know.

Thanks for the question.
As suggested above the AM is a good route to write and submit one's application without invoking spark-submit.
The community has built around the spark-submit command for YARN with the addition of flags that ease the addition of jars and/or configs etc. that are needed to get the application to execute successfully. Submitting Applications
An alternate solution(could try): You could have the spark job as an action in an Oozie workflow. Oozie Spark Extension
Depending on what you wish to achieve, either route looks good.
Hope it helps.

Related

Installing Apache Spark Packages to run Locally

I am looking for a clear guide or steps to installing Spark packages (specifically spark-avro) to run locally and correctly using them with spark-submit command.
I've spent a lot of time reading many posts and guides, but still not able to get spark-submit to use the locally deployed spark-avro package. Hence, if someone has already accomplished this with spark-avro or another package, please share your wisdom :)
All the existing documentation I found is a bit unclear.
Clear steps and examples would be much appreciated! P.S. I know Python/PySpark/SQL, but not much Java (yet) ...
Michael
In spark-submit command itself you can pass avro package details (make sure avro and spark version support)
spark-submit --packages org.apache.spark:spark-avro_<required_version>:<spark_version>
Example,
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0
same way you can pass it along with spark-shell command as well to work on avro files.

Spark/k8s: How do I install Spark 2.4 on an existing kubernetes cluster, in client mode?

I want to install Apache Spark v2.4 on my Kubernetes cluster, but there does not seem to be a stable helm chart for this version. An older/stable chart (for v1.5.1) exists at
https://github.com/helm/charts/tree/master/stable/spark
How can I create/find a v2.4 chart?
Then: The reason for needing v2.4 is to enable client-mode, because I would like to be able to submit (PySpark/Jupyter notebook) jobs to the cluster from my laptop's dev environment. What extra steps are required to enable client-mode (including exposing the service)?
The closest attempt so far (but for Spark v2.0.0) that I have found, but which I haven't yet got working, is at
https://github.com/Uninett/kubernetes-apps/tree/master/spark
At https://github.com/phatak-dev/kubernetes-spark (also two years old), there is nothing about jupyter deployment.
Pangeo-specific: https://discourse.jupyter.org/t/spark-integration-documentation/243
SO thread: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1030
I have searched for up-to-date resources on this but have found nothing that has everything in one place. I will update this question with other relevant links if and when people are able to point them out to me. Hopefully it will be possible to cobble together an answer.
As ever, huge thanks in advance.
Update:
https://github.com/SnappyDataInc/spark-on-k8s for v2.2 is extremely easy to deploy - looks promising...
see https://hub.helm.sh/charts/microsoft/spark this is based off https://github.com/helm/charts/tree/master/stable/spark and uses spark 2.4.6 with hadoop 3.1. You can check the source for this chat at https://github.com/dbanda/charts. The Livy service makes it easy to submit spark jobs via REST API. You can also submit jobs using Zeppelin. We made this chart as alternative way to run spark on K8s without using the spark-submit k8s mode. I hope it helps.

Is there a way to submit spark job on different server running master

We have a requirement to schedule spark jobs, since we are familiar with apache-airflow we want to go ahead with it to create different workflows. I searched web but did not find a step by step guide to schedule spark job on airflow and option to run them on different server running master.
Answer to this will be highly appreciated.
Thanks in advance.
There are 3 ways you can submit Spark jobs using Apache Airflow remotely:
(1) Using SparkSubmitOperator: This operator expects you have a spark-submit binary and YARN client config setup on our Airflow server. It invokes the spark-submit command with given options, blocks until the job finishes and returns the final status. The good thing is, it also streams the logs from the spark-submit command stdout and stderr.
You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work.
Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.
If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath
(2) Using SSHOperator: Use this operator to run bash commands on a remote server (using SSH protocol via paramiko library) like spark-submit. The benefit of this approach is you don't need to copy the hdfs-site.xml or maintain any file.
(3) Using SimpleHTTPOperator with Livy: Livy is an open source REST interface for interacting with Apache Spark from anywhere. You just need to have REST calls.
I personally prefer SSHOperator :)

How to make sure Spark master node is using the worker nodes? (Google cluster)

I just created a Google Cloud cluster (1 master and 6 workers) and by default Spark is configured.
I have a pure python code that uses NLTK to build the dependency tree for each line from a text file. When I run this code on the master spark-submit run.py I get the same execution time when I run it using my machine.
How to make sure that the master is using the workers in order to reduce the execution time ?
You can check the spark UI. If its running on top of yarn, please open the yarn UI and click on your application id which will open the spark UI. Check under the executors tab it will have the node ip address also.
could you please share your spark submit config.
Your command 'spark-submit run.py' doesn't seem to send your job to YARN. To do such thing, you need to add the --master parameter. For example, a valid command to execute a job in YARN is:
./bin/spark-submit --master yarn python/pi.py 1000
If you execute your job from the master, this execution will be straightforward. Anyway, check this link for another parameter that spark-submit accept.
For a Dataproc cluster (Hadoop Google cluster) you have two options to check the job history including the ones that are running:
By command line from the master: yarn application -list, this option sometimes needs additional configuration. If you have troubles, this link will be useful.
By UI. Dataproc enables you to access the Spark Web UI, it improves monitoring tasks. Check this link to learn how to access the Spark UI and other Dataproc UIs. In summary, you have to create a tunnel and configure your browser to use socks proxy.
Hope the information above help you.

Running spark streaming forever on production

I am developing a spark streaming application which basically reads data off kafka and saves it periodically to HDFS.
I am running pyspark on YARN.
My question is more for production purpose. Right now, I run my application like this:
spark-submit stream.py
Imagine you are going to deliver this spark streaming application (in python) to a client, what would you do in order to keep it running forever? You wouldn't just give this file and say "Run this on the terminal". It's too unprofessional.
What I want to do , is to submit the job to the cluster (or processors in local) and never have to see logs on the console, or use a solution like linux screen to run it in the background (because it seems too unprofessional).
What is the most professional and efficient way to permanently submit a spark-streaming job to the cluster ?
I hope I was unambiguous. Thanks!
You could use spark-jobserver which provides rest interface for uploading your jar and running it . You can find the documentation here spark-jobserver .

Resources