HA of spark streaming application on yarn - apache-spark

We are running spark streaming application on yarn-cluster on a cluster that was defined using cloudera.
We defined one of the nodes to be spark-gateway and we run the spark-submit command from this node.
We want to test the HA of our cluster and by this we test what happens when different nodes crush(we stop them).
We saw that when we stop the driver node, the application still continues to run but it doesn't do anything and when looking at "yarn -list" it still writes the stopped node as the driver node. When we start back the node the application returns to work and the driver node changes to another node, but this happen only when the node is back up. Shouldn't yarn change the driver to another node as soon as the driver node dies ?
Another thing we saw that if we kill the spark-gateway node the application stops.
How can we run the application so it won't have any one point fail-over ?

Related

Can I specify a certain machine to be the driver in spark on yarn?

The question is exactly what is specified in the title.
I want to start my driver program on 192.168.1.1, but the fact is when I submit my spark application to yarn, yarn will choose a random machine to be the driver of my application.
Can I choose the driver manually in yarn cluster mode?
the dupilicated question won't work on yarn.
Like Yaron replied before, with YARN as master you have two options:
client
cluster
If you select cluster mode then you let yarn manage where the driver is spawned, based on resource availability in Yarn. If you select client mode then the driver is spawned in the client process, on the server where you ran the spark-submit.
So, a solution for your problem should be to run the command
spark-submit --master yarn --deploy-mode client ...
on the machine you want the driver to be on.
Make sure that:
the machine has the resources to host the driver,
the resources you want to give to the driver are not committed to Yarn as well
there is a Spark gateway (for CM) role on that machine
If you want to use a specific machine as the driver, you should use YARN Client mode
SPARK docs - launching spark on yarn:
There are two deploy modes that can be used to launch Spark
applications on YARN. In cluster mode, the Spark driver runs inside an
application master process which is managed by YARN on the cluster,
and the client can go away after initiating the application. In client
mode, the driver runs in the client process, and the application
master is only used for requesting resources from YARN.
In YARN Client mode - the driver runs in the client process (you can choose the driver machine, it is the machine which execute the spark-submit command)
In YARN Cluster mode - the Spark driver runs inside an application master process which is managed by YARN on the cluster.

Control which worker node gets the driver program in deploy mode cluster

I am running multiple spark-submit applications using --deploy-mode cluster. However, it seems that only one node in the cluster is getting the driver program for each application. As a result, this node's memory get filled pretty fast.
Is there a way to specify which node in the cluster is assigned the driver program for each application launched using spark-submit.

Setting Driver manually in Spark Submit over Yarn Cluster

I noticed that when I start a job in spark submit using yarn, the driver and executor nodes get set randomly. Is it possible to set this manually, so that when I collect the data and write it to file, it can be written on the same node every single time?
As of right now, the parameter I tried playing around with are:
spark.yarn.am.port <driver-ip-address>
and
spark.driver.hostname <driver-ip-address>
Thanks!
If you submit to Yarn with --master yarn --deploy-mode client, the driver will be located on the node you are submitting from.
Also you can configure node labels for executors using property: spark.yarn.executor.nodeLabelExpression
A YARN node label expression that restricts the set of nodes executors will be scheduled on. Only versions of YARN greater than or equal to 2.6 support node label expressions, so when running against earlier versions, this property will be ignored.
Docs - Running Spark on YARN - Latest Documentation
A spark cluster can run in either yarncluster or yarn-client mode.
In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client machine can go away after initiating the application.
In yarn-client mode, the driver runs in the client
process, and the application master is only used for requesting resources from YARN.
So as you see, depending upon the mode, the spark picks up the Application Master. Its not happened randomly until this stage. However, the worker nodes which the application master requests the resource manager to perform tasks will be randomly picked based on the availability of the worker nodes.

Spark: How to specify the IP for the driver program to run

I am having issue to configure specific spark node as driver in my cluster. I am having standalaone mode cluster. Every time on master restart i see that one of the node in the cluster is being randomly picked to run the driver program. Due to which i am enforced to deploy my JAR on all the nodes in my cluster.
If i can specify the IP for the driver program to run, then i need to deploy the JAR only in one node.
Appreciate, any help.
If you want to run from a particular node you can use:
--deploy-mode client
With this option the the driver program will always be running on the machine from where you run spark-submit.
For more information:
http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit

Spark streaming fails to launch worker on worker failure

I'm trying to setup a spark cluster and I've come across an annoying bug...
When I submit a spark application it runs fine on workers until I kill one (for example by using stop-slave.sh on the worker node).
When the worker is killed spark will then try to relaunch an executor on an available worker node but it fails everytime (I know because the webUI either displays FAILED or LAUNCHING for the executor, it never succeeds).
I can't seem to find any help, even on the documentation, so can someone assure me that spark can and will try to relaunch a worker on an available node if one is killed (on the same node where the worker previously ran or on another available node if the node where it previously rank is unreachable) ?
Here's the output from the worker node :
Spark worker error
Thank you for your help !

Resources