Spark application behind a NAT using YARN cluster mode - apache-spark

In client deploy mode a Spark driver needs to be able to receive incoming TCP connections from Spark executors. However, if the Spark driver is behind a NAT, it cannot receive incoming connections. Will running the Spark driver in YARN cluster deploy mode overcome this limitation of being behind a NAT, because the Spark driver is then apparently executed on the Spark master?

Will running the Spark driver in YARN cluster deploy mode overcome this limitation of being behind a NAT, because the Spark driver is then apparently executed on the Spark master?
Yes, it will. Another possible approach is to configure:
spark.driver.port
spark.driver.bindAddress
and create SSH tunnel to one of the nodes.

Related

How to run Apache Spark on all available Nodes for Test Network Connectivity

I have a requirement to validate network reachability from all Nodes in the Cloudera CDH6.3 cluster to ensure that other networks are able to connect from all nodes in the cluster.
Is it possible to make Spark to run on all nodes with some spark-submit config so that from all nodes I can use TCP connections to other network host and port.
The number of executors would control the fan-out of the nodes of the cluster that run a job
spark-submit --num-executors=N
But I would use Ansible for a much simpler mass telnet/nc port check

How to run Spark driver in HA mode?

I have a Spark driver submitted to a Mesos cluster (with highly-available Mesos masters) in client mode (see this for client deploy mode).
I'd like to run the Spark driver in HA mode, too. How?
I can implement my own implementation for this, but for now looking for anything available.
tl;dr Use cluster deploy mode with --supervise, e.g. spark-submit --deploy-mode cluster --supervise
Having a HA of a Spark driver in client mode is not possible as described in the cited document:
In client mode, a Spark Mesos framework is launched directly on the client machine and waits for the driver output.
You'd have to somehow monitor the process on the client machine and check its exit code perhaps.
A much safer solution is to let Mesos do its job. You should use cluster deploy mode in which it's Mesos to make sure the driver runs (and gets restarted when goes down). See the section Cluster mode:
Spark on Mesos also supports cluster mode, where the driver is launched in the cluster and the client can find the results of the driver from the Mesos Web UI.

Can driver process run outside of the Spark cluster?

I read an answer from What conditions should cluster deploy mode be used instead of client?,
(In client mode) You could run spark-submit on your laptop, and the Driver Program would run on your laptop.
Also, the Spark Doc says,
In client mode, the driver is launched in the same process as the client that submits the application.
Does it mean that I can submit spark tasks from any machine, as long as it can be reachable from master and has Spark environment?
Or in other words, can driver process run outside of the Spark cluster?
Yes, the driver can run on your laptop. Keep in mind though:
The Spark driver will need the Hadoop configuration to be able to talk to YARN and HDFS. You could copy it from the cluster and point to it via HADOOP_CONF_DIR.
The Spark driver will listen on a lot of ports and expect the executors to be able to connect to it. It will advertise the hostname of your laptop. Make sure it can be resolved and all ports accessed from the cluster environment.
Yes, I'm running spark-submit jobs over the LAN using option --deploy-mode cluster. Currently running into this issue however: the server response (json object) isn't very descriptive.

In Spark's client mode, the driver needs network access to remote executors?

When using spark at client mode (e.g. yarn-client), does the local machine that runs the driver communicates directly with the cluster worker nodes that run the remote executors?
If yes, does it mean the machine (that runs the driver) need to have network access to the worker nodes? So the master node requests resources from the cluster, and returns the IP addresses/ports of the worker nodes to the driver, so the driver can initiating the communication with the worker nodes?
If not, how does the client mode actually work?
If yes, does it mean that the client mode won't work if the cluster is configured in a way that the work nodes are not visible outside the cluster, and one will have to use cluster mode?
Thanks!
The Driver connects to the Spark Master, requests a context, and then the Spark Master passes the Spark Workers the details of the Driver to communicate and get instructions on what to do.
The means that the driver node must be available on the network to the workers, and it's IP must be one that's visible to them (i.e. if the driver is behind NAT, while the workers are in a different network, it won't work and you'll see errors on the workers that they fail to connect to the driver)
When you run Spark in client mode, the driver process runs locally.
In cluster mode, it runs remotely on an ApplicationMaster.
In other words you will need all the nodes to see each other. Spark driver definitely needs to communicate with all the worker nodes. If this is a problem try to use the yarn-cluster mode, then the driver will run inside your cluster on one of the nodes.

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

Resources