Spark Yarn-client mode across network via VPN - apache-spark

I have been trying to get Spark yarn-client mode working through VPN. More specifically, spark driver will be launched locally from my laptop, while the yarn cluster is in its own private network reachable through a non-bridged VPN.
The first challenge was to make the spark driver service reachable from yarn-cluster since the VPN is one-way, my laptop is not routable from the cluster.
I managed to get this working by adding an entry in /etc/hosts to point a public domain name to my local network IP, something like
192.168.0.6 spark.driver.mydomain
Then I set spark.driver.host=spark.driver.mydomain.
Now spark driver can successfully bind to spark.driver.mydomain, and tell yarn application manager to connect to spark.driver.mydomain. I also need to configure spark.driver.mydomain to point to my public IP by modifying my domain's DNS, and configure firewall to make the service publicly available.
Now I can run spark from my laptop to drive the cluster, almost there. However the SparkUI doesn't work. There is no way to connect to SparkUI despite of the message says it's suffcessfully started at spark.driver.mydomain:4040. I opened all the ports through my local network's firewall using DMZ. I also tried to use local network IP address. I can notice it is being redirected to yarn resource managers link, http://resourcemanager/proxy/application_id but just get timed out eventually, and I haven't figured out how the proxy thing works.
The spark session also occasionally spits out warning messages like
WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor#executor:port] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
The basic spark actions all works despite of the warning message.
There are still quite a few concerns and questions
Does the communication between spark driver and yarn cluster contain unencrypted data in this scenario? Is there any data security concerns ( assuming the VPN is secure).
SparkUI is not accessible, which is intolerable.
Warning messages
Is it really a good practice to run driver from a remote network in yarn-client mode? There are certainly other benefits to do so, but is the framework designed to do this?
Finally, here is a JIRA issue that may lead to more general solutions. https://issues.apache.org/jira/browse/SPARK-5113

Related

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

Most common Spark error

I have Spark Standalone Cluster which is doing nothing. It has such properties.
spark.executor.memory 5g
spark.driver.memory 5g
spark.cores.max 10
spark.deploy.defaultCores 5
And I have an app which creates SparkContext (which points to my cluster) and then apply some action on rdd. And it fails after first action with this extremely popular error:
Initial has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Ok. As I understood I got this error after I have asked more cores/memory than cluster could provide me. It is ok but I do not ask any resources in my app (I do not specify neither --executor-memory nor --total-executor-cores) Then what it can be?
PS: Cluster seems to be fine because I can submit some jar through ./bin/submit and it works. But with this app it does not even appear in "Running Applications" section of server's web interface.
You can check your firewall settings.
The host firewall on the host where I ran my PySpark shell rejected the connection attempts back from the worker nodes.
After allowing all traffic between all nodes involved, the problem was resolved!
The driver host was another VM in the same OpenStack project,
so allowing all traffic between the VMs in the same project was OK to do security-wise.
Spark – How to fix “WARN TaskSchedulerImpl: Initial job has not accepted any resources”

Where is the Spark UI on Google Dataproc?

What port should I use to access the Spark UI on Google Dataproc?
I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln
Firewall is properly configured.
Dataproc runs Spark on top of YARN, so you won't find the typical "Spark standalone" ports; instead, when running a Spark job, you can visit port 8088 which will show you the YARN ResourceManager's main page. Any running Spark jobs will be accessible through the Application Master link on that page. The Spark Application Master's page looks the same as the familiar Spark-standalone landing page that you would normally find on port 8080 for default Spark setups.
Since workers check in over the internal network, YARN's links will be using cluster-internal hostnames (the hostnames should include your Dataproc cluster name as a prefix), but this means if you're accessing from the outside network, the links may not work at first; you have to replace the hostname with the external IP address if you're using the firewall-based approach.
An easier experience will be to use the SOCKS proxy approach as explained here: https://cloud.google.com/dataproc/cluster-web-interfaces
In that case, simply using gcloud compute ssh to run a lightweight local socks proxy and then opening a browser pointed at that will let you click all the YARN links as normal.
When following the instructions in Dennis's answer, I found that I could not connect to ports 8080 or 8088 for dataproc image v1.0.
The open ports on the master node suggested to use 18080, which I did following the documentation for port 18080 and voilá: Access to webui.
Since I had public addresses in my DataProc cluster I created a Firewall rule in Cloud Console from my corporate subnet to DataProc instances ports 8088 (YARN RM) and 8042 (NM Webapp address).

How to run PySpark (possibly in client mode) on Mesosphere cluster?

I am trying to run a PySpark job on a Mesosphere cluster but I cannot seem to get it to run. I understand that Mesos does not support cluster deploy mode for PySpark applications and that it needs to be run in client mode. I believe this is where the problem lies.
When I try submitting a PySpark job I am getting the output below.
... socket.hpp:107] Shutdown failed on fd=48: Transport endpoint is not connected [107]
I believe that a spark job running in client mode needs to connect to the nodes directly and this is being blocked?
What configuration would I need to change to be able to run a PySpark job in client mode?
When running PySpark in client mode (meaning the driver is running where you invoke Python) the driver becomes the Mesos Framework. When this happens, the host the framework is running on needs to be able to connect to all nodes in the cluster, and they need to be able to connect back, meaning no NAT.
If this is indeed the cause of your problems, there are two environment variables that might be useful. If you can get a VPN in place, you can set LIBPROCESS_IP and SPARK_LOCAL_IP both to the IP of the host machine that cluster nodes can use to connect back to the driver.

How to use Zookeeper with Azure HDInsight Linux cluster?

Obviously I need to start a zookeeper server on one of the cluster machines, then I need other client machines to connect to this server.
The way I did it is that I used ssh to connect to the headnode, I found a zk server running on the port 2181. So, I used ifconfig to get the machine's IP address (for example 10.0.0.8) and i then had my worker nodes connect to:
10.0.0.8:2181.
However, my MR job now completes but it works slowly and the output is not correct. I suspect that I'm doing something wrong with Zookeeper, especially that I didn't follow a tutorial and improvised my steps.
HDInsight has multiple zookeeper servers. Not sure if specifying one might be the cause of the problem you are seeing.
I wrote an example a while back that uses Storm to write to HBase (both servers on the same Azure Virtual Network,) and as part of the configuration, I had to specify the three zookeeper servers for the component that writes to hbase. (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-sensor-data-analysis/ is the article.)
From the cluster head node, you can probably ping zookeeper0, zookeeper1, and zookeeper2 to find the IP address of each.

Resources