Where is the Spark UI on Google Dataproc? - apache-spark

What port should I use to access the Spark UI on Google Dataproc?
I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln
Firewall is properly configured.

Dataproc runs Spark on top of YARN, so you won't find the typical "Spark standalone" ports; instead, when running a Spark job, you can visit port 8088 which will show you the YARN ResourceManager's main page. Any running Spark jobs will be accessible through the Application Master link on that page. The Spark Application Master's page looks the same as the familiar Spark-standalone landing page that you would normally find on port 8080 for default Spark setups.
Since workers check in over the internal network, YARN's links will be using cluster-internal hostnames (the hostnames should include your Dataproc cluster name as a prefix), but this means if you're accessing from the outside network, the links may not work at first; you have to replace the hostname with the external IP address if you're using the firewall-based approach.
An easier experience will be to use the SOCKS proxy approach as explained here: https://cloud.google.com/dataproc/cluster-web-interfaces
In that case, simply using gcloud compute ssh to run a lightweight local socks proxy and then opening a browser pointed at that will let you click all the YARN links as normal.

When following the instructions in Dennis's answer, I found that I could not connect to ports 8080 or 8088 for dataproc image v1.0.
The open ports on the master node suggested to use 18080, which I did following the documentation for port 18080 and voilá: Access to webui.

Since I had public addresses in my DataProc cluster I created a Firewall rule in Cloud Console from my corporate subnet to DataProc instances ports 8088 (YARN RM) and 8042 (NM Webapp address).

Related

How to expose Spark Driver behind dockerized Apache Zeppelin?

I am currently building a custom docker container from a plain distribution with Apache Zeppelin + Spark 2.x inside.
My Spark jobs will run in a remote cluster and I am using yarn-client as master.
When I run a notebook and try to print sc.version, the program gets stuck. If I go to the remote resource manager, an application has been created and accepted but in the logs I can read:
INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable
My understanding of the situation is that the cluster is unable to talk to the driver in the container but I don't know how to solve this issue.
I am currently using the following configuration:
spark.driver.port set to PORT1 and option -p PORT1:PORT1 passed to the container
spark.driver.host set to 172.17.0.2 (ip of the container)
SPARK_LOCAL_IP set to 172.17.0.2 (ip of the container)
spark.ui.port set to PORT2 and option -p PORT2:PORT2 passed to the container
I have the feeling I should change the SPARK_LOCAL_IP to the host ip but if I do so, SparkUI is unable to start, blocking the process a step before.
Thanks in advance for any ideas / advices !
Good question! First of all, as you know Apache Zeppelin runs interpreters in a separate processes.
In your case, Spark interpreter JVM process hosts a SparkContext and serves as aSparkDriver instance for the yarn-client deployment mode. This process inside the container, according to the Apache Spark documentation, needs to be able to communicate back and forth to\from YARN ApplicationMaster and all SparkWorkers machines of the cluster.
This implies that you have to have number of ports open and manually forwarded between the container and a host machine. Here is an example of a project at ZEPL doing similar job, where it took us 7 ports to get the job done.
Anoter aproach can be running Docker networking in a host mode (though it apparently does not work on os x, due to a recent bug)

Addressing issues with Apache Spark application run in Client mode from Docker container

I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.
Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip. The docker address is not visible from outside so an application won't work.
Spark has spark.driver.host property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.
Unfortunately the spark.driver.host is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.
It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.
Ideally I would like to have two properties. The spark.driver.host-to-bind-to used to set up the driver server and the spark.driver.host-for-master which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.
Another approach would be to use --net=host when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host on and must be exposed outside of the docker network) and I would like to avoid it.
Is there any way I could solve the driver-addressing problem without exposing the docker containers?
This problem is fixed in https://github.com/apache/spark/pull/15120
It will be part of Apache Spark 2.1 release

Accessing Spark Web UI from another place than where the job actually ran

I have a spark cluster with 1 master 9nodes.I am running in standalone-mode. I do not have access to a web browser from any of the nodes in the cluster (I am connecting to the nodes through ssh --it is a grid5000 cluster).
I was wondering, is there any possibility to access Spark Web UI in this case? I tried by copying the logs from my cluster in SPARK_PATH/work on my local machine (leaving the impression that the jobs that ran in the cluster were ran on my local machine). This idea came after reading this part from the documentation:
If an application has logged events over the course of its lifetime, then the Standalone master’s web UI will automatically re-render the application’s UI after the application has finished.
But it did not work. What I can see in the UI is:
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Thank you!
You don't need to copy anything, just access port 8080 on the master machine or port 4040 on the application machine (while the application is running). If the machines are not externally accessible you have to tunnel through SSH.
Tunneling through SSH is a popular topic, just search for it. I personally use ssh -D 9999 and then set up localhost:9999 as a proxy using the FoxyProxy plugin which exists for both Firefox and Chrome.

How to use Zookeeper with Azure HDInsight Linux cluster?

Obviously I need to start a zookeeper server on one of the cluster machines, then I need other client machines to connect to this server.
The way I did it is that I used ssh to connect to the headnode, I found a zk server running on the port 2181. So, I used ifconfig to get the machine's IP address (for example 10.0.0.8) and i then had my worker nodes connect to:
10.0.0.8:2181.
However, my MR job now completes but it works slowly and the output is not correct. I suspect that I'm doing something wrong with Zookeeper, especially that I didn't follow a tutorial and improvised my steps.
HDInsight has multiple zookeeper servers. Not sure if specifying one might be the cause of the problem you are seeing.
I wrote an example a while back that uses Storm to write to HBase (both servers on the same Azure Virtual Network,) and as part of the configuration, I had to specify the three zookeeper servers for the component that writes to hbase. (https://azure.microsoft.com/en-us/documentation/articles/hdinsight-storm-sensor-data-analysis/ is the article.)
From the cluster head node, you can probably ping zookeeper0, zookeeper1, and zookeeper2 to find the IP address of each.

Spark Yarn-client mode across network via VPN

I have been trying to get Spark yarn-client mode working through VPN. More specifically, spark driver will be launched locally from my laptop, while the yarn cluster is in its own private network reachable through a non-bridged VPN.
The first challenge was to make the spark driver service reachable from yarn-cluster since the VPN is one-way, my laptop is not routable from the cluster.
I managed to get this working by adding an entry in /etc/hosts to point a public domain name to my local network IP, something like
192.168.0.6 spark.driver.mydomain
Then I set spark.driver.host=spark.driver.mydomain.
Now spark driver can successfully bind to spark.driver.mydomain, and tell yarn application manager to connect to spark.driver.mydomain. I also need to configure spark.driver.mydomain to point to my public IP by modifying my domain's DNS, and configure firewall to make the service publicly available.
Now I can run spark from my laptop to drive the cluster, almost there. However the SparkUI doesn't work. There is no way to connect to SparkUI despite of the message says it's suffcessfully started at spark.driver.mydomain:4040. I opened all the ports through my local network's firewall using DMZ. I also tried to use local network IP address. I can notice it is being redirected to yarn resource managers link, http://resourcemanager/proxy/application_id but just get timed out eventually, and I haven't figured out how the proxy thing works.
The spark session also occasionally spits out warning messages like
WARN ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor#executor:port] has failed, address is
now gated for [5000] ms. Reason is: [Disassociated].
The basic spark actions all works despite of the warning message.
There are still quite a few concerns and questions
Does the communication between spark driver and yarn cluster contain unencrypted data in this scenario? Is there any data security concerns ( assuming the VPN is secure).
SparkUI is not accessible, which is intolerable.
Warning messages
Is it really a good practice to run driver from a remote network in yarn-client mode? There are certainly other benefits to do so, but is the framework designed to do this?
Finally, here is a JIRA issue that may lead to more general solutions. https://issues.apache.org/jira/browse/SPARK-5113

Resources